This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

123 lines
14 KiB

\chapter{Evaluation}
\label{chap:evaluation}
% Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus
% diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,
% die Leistungsfähigkeit zu bewerten und welche Ergebnisse dabei erzielt
% wurden. Wichtig ist es, dem Leser nicht nur ein paar Zahlen
% hinzustellen, sondern auch eine Diskussion der Ergebnisse
% vorzunehmen. Es wird empfohlen zunächst die eigenen Erwartungen
% bezüglich der Ergebnisse zu erläutern und anschließend eventuell
% festgestellte Abweichungen zu erklären.
In this chapter, we establish anticipated outcomes for incorporating the developed Cache into \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the choices made regarding the benchmarking methodology and their influence on our results. \todo{extend last sentence with what we actually do in 6.4} \par
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at 4 GiB. The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching. \par
The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. \todo{mention that this has one of two side effects: queue overflow or requiring larger task sizes to prevent the overflow} \par
\todo{possibly describe iteration count, warmup and averaging over iterations here}
\section{Expectations}
\label{sec:eval:expectations}
The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly reduces the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}. \par
\section{Observations}
\label{sec:eval:observations}
In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influence performance \cite{dimes-prefetching}, which however is not inside the scope of this work. Therefore, we only present the best benchmarked results for each configuration. \par
\todo{wording of last sentence is a bit odd}
\subsection{Benchmarks without Prefetching}
We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
\begin{table}[!t]
\centering
\input{tables/table-qdp-baseline.tex}
\caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Raw Time is averaged over 5 iterations with previous warm up.}
\label{table:qdp-baseline}
\end{table}
From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different pipeline stages. \par
The following plots are normalized so that the longest execution from Figures \ref{fig:timing-comparison} and \ref{fig:timing-results} fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-pipeline timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and Figures may be observed. \par
\begin{figure}[!t]
\centering
\begin{subfigure}[!t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-dram.pdf}
\caption{Columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
\label{fig:timing-comparison:baseline}
\end{subfigure}
\hspace{5mm}
\begin{subfigure}[!t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-hbm.pdf}
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
\label{fig:timing-comparison:upplimit}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
\label{fig:timing-comparison}
\end{figure}
Due to the higher bandwidth provided by \gls{hbm} for \(AGGREGATE\), the CPU waits less for data from main memory, thereby improving processing times. This is evident in the overall shorter time taken for \(AGGREGATE\) in Figure \ref{fig:timing-comparison:upplimit} compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with aggregate requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par
\subsection{Benchmarks using Prefetching}
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario, through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
\begin{table}[!t]
\centering
\input{tables/table-qdp-speedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes. Raw Time is averaged over 5 iterations with previous warm up.}
\label{table:qdp-speedup}
\end{table}
The slowdown below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, this result becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, and there is additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase compared to our baseline, although it does not reach the upper boundary set by simulating perfect prefetching. This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par
\todo{add reference to the table for this paragraph}
\begin{figure}[!t]
\centering
\begin{subfigure}[!t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-prefetch.pdf}
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
\label{fig:timing-results:prefetch}
\end{subfigure}
\hspace{5mm}
\begin{subfigure}[!t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-distprefetch.pdf}
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram} \glsentryshort{numa:node}s.}
\label{fig:timing-results:distprefetch}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
\label{fig:timing-results}
\end{figure}
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. This prolonged duration of execution leads to extended overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate, minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
\todo{possibly mention that scanb/scana parallel have the same length as scanb is not directly affected by the bandwidth contention - it only depicts the overhead of caching and work submission - as the dsa itself performs the copy}
Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par
\section{Discussion}
In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes. \par
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par
As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as Non-Volatile or Remote Memory, as mentioned in Chapter \ref{chap:intro}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to non-volatile memory for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: