This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

115 lines
14 KiB

\chapter{Evaluation}
\label{chap:evaluation}
% Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus
% diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,
% die Leistungsfähigkeit zu bewerten und welche Ergebnisse dabei erzielt
% wurden. Wichtig ist es, dem Leser nicht nur ein paar Zahlen
% hinzustellen, sondern auch eine Diskussion der Ergebnisse
% vorzunehmen. Es wird empfohlen zunächst die eigenen Erwartungen
% bezüglich der Ergebnisse zu erläutern und anschließend eventuell
% festgestellte Abweichungen zu erklären.
In this chapter, we establish anticipated outcomes for incorporating the developed \texttt{Cache} into \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the \texttt{Cache} and the results observed. \par
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark involves the execution of a simple query, which, as depicted in the execution plan in Figure \ref{fig:qdp-simple-query}, is divided into multiple processing stages. We will use the following notations for the three tasks performed in executing the query: (1) \(SCAN_a\), responsible for scanning and subsequently filtering column \texttt{a}; (2) \(SCAN_b\), tasked with prefetching column \texttt{b}; (3) \(AGGREGATE\), the projection and final summation step. The column size utilized is set at \(4\ GiB\). The workload is distributed across multiple groups, with each group spawning threads for every step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-task duration and cache hit ratio for prefetching over five iterations, forming the average. This is preceded by five warm-up runs, for which no timings are recorded. \par
The tasks \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). This burst-submission could cause the \gls{dsa}s work queue to overflow, leading us to increase the size of the processed chunks for the benchmarks utilizing \gls{qdp} to reduce the amount of cached blocks. \par
\section{Expectations}
\label{sec:eval:expectations}
The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly reduces the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}. \par
\section{Observations}
\label{sec:eval:observations}
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-dram.pdf}
\caption{Columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
\label{fig:timing-comparison:baseline}
\end{subfigure}
\hspace{5mm}
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-hbm.pdf}
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
\label{fig:timing-comparison:upplimit}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
\label{fig:timing-comparison}
\end{figure}
\begin{table}[t]
\centering
\input{tables/table-qdp-baseline.tex}
\caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline. \glsentryshort{hbm} presents the upper boundary achievable, as it simulates prefetching without processing overhead and delay.}
\label{table:qdp-baseline}
\end{table}
In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per processing stage and the amount of groups influences performance \cite{dimes-prefetching}, which however is out of scope for this work. Therefore, results shown are for the best configurations measured. \par
The plots for detailed timing are normalized so that the longest running configuration fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-task timings do not cover. Therefore, a discrepancy between the raw values from the Tables, and the Figures may be observed. \par
\subsection{Benchmarks without Prefetching}
We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different stages of the query execution plan. \par
Due to the higher bandwidth provided by accessing column \texttt{b} through \gls{hbm} in \(AGGREGATE\), processing time is shortened, as demonstrated by comparing Figure \ref{fig:timing-comparison:upplimit} to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with \(AGGREGATE\) requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par
\subsection{Benchmarks using Prefetching}
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
\begin{table}[t]
\centering
\input{tables/table-qdp-speedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
\label{table:qdp-speedup}
\end{table}
We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the threads processing \(SCAN_a\), while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (\enquote{HBM (Upper Limit)} in Table \ref{table:qdp-speedup}). This confirms our assumption that \(SCAN_a\) itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-task timings. \par
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-prefetch.pdf}
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}
\label{fig:timing-results:prefetch}
\end{subfigure}
\hspace{5mm}
\begin{subfigure}[t]{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-timing-distprefetch.pdf}
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram} \glsentryshort{numa:node}s.}
\label{fig:timing-results:distprefetch}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
\label{fig:timing-results}
\end{figure}
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it offloads memory access to the \gls{dsa} through the \texttt{Cache}, thereby not showing extended runtime from the throughput bottleneck. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the \(SCAN\)-stage and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par
\section{Discussion}
In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes. \par
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up, positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par
As stated in Chapter \ref{chap:design}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram}, were not considered. Despite the methods of the \texttt{Cache}-class being named with usage as a cache in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{subsec:design:policy-functions}. Potential applications include replication to \gls{nvram} for data loss prevention, or restoring from \gls{nvram} for faster processing. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: