This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

84 lines
6.9 KiB

  1. \chapter{Evaluation}
  2. \label{chap:evaluation}
  3. % Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus
  4. % diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,
  5. % die Leistungsfähigkeit zu bewerten und welche Ergebnisse dabei erzielt
  6. % wurden. Wichtig ist es, dem Leser nicht nur ein paar Zahlen
  7. % hinzustellen, sondern auch eine Diskussion der Ergebnisse
  8. % vorzunehmen. Es wird empfohlen zunächst die eigenen Erwartungen
  9. % bezüglich der Ergebnisse zu erläutern und anschließend eventuell
  10. % festgestellte Abweichungen zu erklären.
  11. In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
  12. \section{Benchmarked Task}
  13. \label{sec:eval:bench}
  14. The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB and divide this work over 32 groups, assigning one thread to each pipeline per group \todo{maybe describe that the high group count was chosen to allow the implementation without overlapping pipeline execution to still perform well} \todo{consider moving to overlapping pipelines if enough time is found}. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
  15. Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
  16. To ensure
  17. \section{Expectations}
  18. \label{sec:eval:expectations}
  19. The simple query presents a challenging scenario to the \texttt{Cache}. As the filter operation applied to column \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the \texttt{Cache} has little time during which it must prefetch, which will amplify delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Additionally, it can be assumed that the task is memory bound. As the prefetching of \texttt{b} in \(SCAN_b\) and the load and subsequent filter of \texttt{a} in \(SCAN_a\) will execute in parallel, caching therefore directly reduces the memory bandwidth available to \(SCAN_a\), when both columns are located on the same \gls{numa:node}. \par
  20. Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
  21. \section{Observations}
  22. In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, representing the upper and lower boundaries respectively. \par
  23. \begin{figure}[h!tb]
  24. \centering
  25. \begin{subfigure}[t]{0.75\textwidth}
  26. \centering
  27. \includegraphics[width=\textwidth]{images/plot-timing-dram.pdf}
  28. \caption{Columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
  29. \label{fig:timing-comparison:baseline}
  30. \end{subfigure}
  31. \begin{subfigure}[t]{0.75\textwidth}
  32. \centering
  33. \includegraphics[width=\textwidth]{images/plot-timing-hbm.pdf}
  34. \caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
  35. \label{fig:timing-comparison:upplimit}
  36. \end{subfigure}
  37. \caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
  38. \label{fig:timing-comparison}
  39. \end{figure}
  40. Our baseline will be the performance achieved with both columns \texttt{a} and \texttt{b} located in \glsentryshort{dram} and no prefetching. The upper limit will be represented by measuring the scenario where \texttt{b} is already located in \gls{hbm} at the start of the benchmark, simulating prefetching with no overhead or delay. \par
  41. \begin{figure}[h!tb]
  42. \centering
  43. \begin{subfigure}[t]{0.75\textwidth}
  44. \centering
  45. \includegraphics[width=\textwidth]{images/plot-timing-prefetch.pdf}
  46. \caption{Prefetching with columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
  47. \label{fig:timing-results:prefetch}
  48. \end{subfigure}
  49. \begin{subfigure}[t]{0.75\textwidth}
  50. \centering
  51. \includegraphics[width=\textwidth]{images/plot-timing-distprefetch.pdf}
  52. \caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram} \glsentryshort{numa:node}s.}
  53. \label{fig:timing-results:distprefetch}
  54. \end{subfigure}
  55. \caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
  56. \label{fig:timing-results}
  57. \end{figure}
  58. \begin{itemize}
  59. \item Fig \ref{fig:timing-results:distprefetch} aggr for prefetch theoretically at level of hbm but due to more concurrent workload aggr gets slowed down too
  60. \item Fig \ref{fig:timing-results:prefetch} scana increase due to sharing bandwidth reasonable, aggr seems a bit unreasonable
  61. \end{itemize}
  62. \todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}
  63. \section{Discussion}
  64. %%% Local Variables:
  65. %%% TeX-master: "diplom"
  66. %%% End: