This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

89 lines
7.6 KiB

  1. \chapter{Evaluation}
  2. \label{chap:evaluation}
  3. % Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus
  4. % diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,
  5. % die Leistungsfähigkeit zu bewerten und welche Ergebnisse dabei erzielt
  6. % wurden. Wichtig ist es, dem Leser nicht nur ein paar Zahlen
  7. % hinzustellen, sondern auch eine Diskussion der Ergebnisse
  8. % vorzunehmen. Es wird empfohlen zunächst die eigenen Erwartungen
  9. % bezüglich der Ergebnisse zu erläutern und anschließend eventuell
  10. % festgestellte Abweichungen zu erklären.
  11. In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
  12. \section{Benchmarked Task}
  13. \label{sec:eval:bench}
  14. The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB. The work is divided over multiple groups and each group spawns threads for each pipeline step. For a fair comparison, each tested configuration uses 64 Threads for the first stage (\(SCAN_a\) and \(SCAN_b\)) and 32 for the second stage (\(AGGREGATE\)), while being restricted to run on \gls{numa:node} 0 through pinning. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
  15. Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
  16. \section{Expectations}
  17. \label{sec:eval:expectations}
  18. The simple query presents a challenging scenario to the \texttt{Cache}. As the filter operation applied to column \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the \texttt{Cache} has little time during which it must prefetch, which will amplify delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Additionally, it can be assumed that the task is memory bound. As the prefetching of \texttt{b} in \(SCAN_b\) and the load and subsequent filter of \texttt{a} in \(SCAN_a\) will execute in parallel, caching therefore directly reduces the memory bandwidth available to \(SCAN_a\), when both columns are located on the same \gls{numa:node}. \par
  19. Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
  20. \section{Observations}
  21. In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, representing the upper and lower boundaries respectively. \par
  22. \begin{figure}[h!tb]
  23. \centering
  24. \begin{subfigure}[t]{0.75\textwidth}
  25. \centering
  26. \includegraphics[width=\textwidth]{images/plot-timing-dram.pdf}
  27. \caption{Columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
  28. \label{fig:timing-comparison:baseline}
  29. \end{subfigure}
  30. \begin{subfigure}[t]{0.75\textwidth}
  31. \centering
  32. \includegraphics[width=\textwidth]{images/plot-timing-hbm.pdf}
  33. \caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
  34. \label{fig:timing-comparison:upplimit}
  35. \end{subfigure}
  36. \caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
  37. \label{fig:timing-comparison}
  38. \end{figure}
  39. Our baseline will be the performance achieved with both columns \texttt{a} and \texttt{b} located in \glsentryshort{dram} and no prefetching. The upper limit will be represented by measuring the scenario where \texttt{b} is already located in \gls{hbm} at the start of the benchmark, simulating prefetching with no overhead or delay. \par
  40. \begin{figure}[h!tb]
  41. \centering
  42. \begin{subfigure}[t]{0.75\textwidth}
  43. \centering
  44. \includegraphics[width=\textwidth]{images/plot-timing-prefetch.pdf}
  45. \caption{Prefetching with columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
  46. \label{fig:timing-results:prefetch}
  47. \end{subfigure}
  48. \begin{subfigure}[t]{0.75\textwidth}
  49. \centering
  50. \includegraphics[width=\textwidth]{images/plot-timing-distprefetch.pdf}
  51. \caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram} \glsentryshort{numa:node}s.}
  52. \label{fig:timing-results:distprefetch}
  53. \end{subfigure}
  54. \caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
  55. \label{fig:timing-results}
  56. \end{figure}
  57. \begin{itemize}
  58. \item Fig \ref{fig:timing-results:distprefetch} aggr for prefetch theoretically at level of hbm but due to more concurrent workload aggr gets slowed down too
  59. \item Fig \ref{fig:timing-results:prefetch} scana increase due to sharing bandwidth reasonable, aggr seems a bit unreasonable
  60. \end{itemize}
  61. \todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}
  62. \begin{table}[h!tb]
  63. \centering
  64. \input{tables/table-qdpspeedup.tex}
  65. \caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
  66. \label{table:qdp-speedup}
  67. \end{table}
  68. \section{Discussion}
  69. %%% Local Variables:
  70. %%% TeX-master: "diplom"
  71. %%% End: