This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

115 lines
14 KiB

  1. \chapter{Evaluation}
  2. \label{chap:evaluation}
  3. % Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus
  4. % diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,
  5. % die Leistungsfähigkeit zu bewerten und welche Ergebnisse dabei erzielt
  6. % wurden. Wichtig ist es, dem Leser nicht nur ein paar Zahlen
  7. % hinzustellen, sondern auch eine Diskussion der Ergebnisse
  8. % vorzunehmen. Es wird empfohlen zunächst die eigenen Erwartungen
  9. % bezüglich der Ergebnisse zu erläutern und anschließend eventuell
  10. % festgestellte Abweichungen zu erklären.
  11. In this chapter, we establish anticipated outcomes for incorporating the developed \texttt{Cache} into \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the \texttt{Cache} and the results observed. \par
  12. \section{Benchmarked Task}
  13. \label{sec:eval:bench}
  14. The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at \(4\ GiB\). The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching for 5 iterations with 5 previous warm-up runs, and form the average. \par
  15. The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. This burst-submission could cause the \gls{dsa}s work queue to overrun, leading us to increase the size of the blocks for the benchmarks utilizing \gls{qdp} to avoid this. \par
  16. \section{Expectations}
  17. \label{sec:eval:expectations}
  18. The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly reduces the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}. \par
  19. \section{Observations}
  20. \label{sec:eval:observations}
  21. In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influences performance \cite{dimes-prefetching}, which however is out of scope for this work. Therefore, results shown are for the best configurations measured. \par
  22. \subsection{Benchmarks without Prefetching}
  23. We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
  24. \begin{table}[!t]
  25. \centering
  26. \input{tables/table-qdp-baseline.tex}
  27. \caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching.}
  28. \label{table:qdp-baseline}
  29. \end{table}
  30. From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different pipeline stages. \par
  31. The following plots are normalized so that the longest execution from Figures \ref{fig:timing-comparison} and \ref{fig:timing-results} fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-pipeline timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and Figures may be observed. \par
  32. \begin{figure}[!t]
  33. \centering
  34. \begin{subfigure}[!t]{0.45\textwidth}
  35. \centering
  36. \includegraphics[width=\textwidth]{images/plot-timing-dram.pdf}
  37. \caption{Columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
  38. \label{fig:timing-comparison:baseline}
  39. \end{subfigure}
  40. \hspace{5mm}
  41. \begin{subfigure}[!t]{0.45\textwidth}
  42. \centering
  43. \includegraphics[width=\textwidth]{images/plot-timing-hbm.pdf}
  44. \caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
  45. \label{fig:timing-comparison:upplimit}
  46. \end{subfigure}
  47. \caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
  48. \label{fig:timing-comparison}
  49. \end{figure}
  50. Due to the higher bandwidth provided by \gls{hbm} for \(AGGREGATE\), the CPU waits less for data from main memory, thereby improving processing times. This is evident in the overall shorter time taken for \(AGGREGATE\) in Figure \ref{fig:timing-comparison:upplimit} compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with aggregate requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par
  51. \subsection{Benchmarks using Prefetching}
  52. To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario, through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
  53. \begin{table}[!t]
  54. \centering
  55. \input{tables/table-qdp-speedup.tex}
  56. \caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
  57. \label{table:qdp-speedup}
  58. \end{table}
  59. We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (called \enquote{HBM (Upper Limit)} in the same Table). This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par
  60. \begin{figure}[!t]
  61. \centering
  62. \begin{subfigure}[!t]{0.45\textwidth}
  63. \centering
  64. \includegraphics[width=\textwidth]{images/plot-timing-prefetch.pdf}
  65. \caption{Prefetching with columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
  66. \label{fig:timing-results:prefetch}
  67. \end{subfigure}
  68. \hspace{5mm}
  69. \begin{subfigure}[!t]{0.45\textwidth}
  70. \centering
  71. \includegraphics[width=\textwidth]{images/plot-timing-distprefetch.pdf}
  72. \caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram} \glsentryshort{numa:node}s.}
  73. \label{fig:timing-results:distprefetch}
  74. \end{subfigure}
  75. \caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
  76. \label{fig:timing-results}
  77. \end{figure}
  78. In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it only performs work submission through the \texttt{Cache} and therefore does not access system memory directly. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the scan and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
  79. Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par
  80. \section{Discussion}
  81. In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes. \par
  82. The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par
  83. As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram} or \gls{remotemem}, as mentioned in Chapter \ref{chap:intro}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to \gls{nvram} for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
  84. %%% Local Variables:
  85. %%% TeX-master: "diplom"
  86. %%% End: