bachelor-thesis/thesis/content/60_evaluation.tex


								\chapter{Evaluation}

								\label{chap:evaluation}


								% Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus

								% diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,

								% die Leistungsfähigkeit zu bewerten und welche Ergebnisse dabei erzielt

								% wurden. Wichtig ist es, dem Leser nicht nur ein paar Zahlen

								% hinzustellen, sondern auch eine Diskussion der Ergebnisse

								% vorzunehmen. Es wird empfohlen zunächst die eigenen Erwartungen

								% bezüglich der Ergebnisse zu erläutern und anschließend eventuell

								% festgestellte Abweichungen zu erklären.


								In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par


								\todo{improve the sections in this chapter}

								\todo{qdp bench code for now uses only the 4 on-node dsas, check (given time) whether round-robinning to all available makes a positive difference}


								\section{Benchmarked Task}

								\label{sec:eval:bench}


								The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB. The work is divided over multiple groups and each group spawns threads for each pipeline step. For a fair comparison, each tested configuration uses 64 Threads for the first stage (\(SCAN_a\) and \(SCAN_b\)) and 32 for the second stage (\(AGGREGATE\)), while being restricted to run on \gls{numa:node} 0 through pinning. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par


								Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par


								\section{Expectations}

								\label{sec:eval:expectations}


								The simple query presents a challenging scenario to the \texttt{Cache}. As the filter operation applied to column \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the \texttt{Cache} has little time during which it must prefetch, which will amplify delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Additionally, it can be assumed that the task is memory bound. As the prefetching of \texttt{b} in \(SCAN_b\) and the load and subsequent filter of \texttt{a} in \(SCAN_a\) will execute in parallel, caching therefore directly reduces the memory bandwidth available to \(SCAN_a\), when both columns are located on the same \gls{numa:node}. \par


								Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par


								\section{Observations}

								\label{sec:eval:observations}


								In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, used as reference in evaluation of the results using our \texttt{Cache}. Two methods were benchmarked, representing a baseline and upper limit for what we can achieve with prefetching. For the former, all columns were located in \glsentryshort{dram}. The latter is achieved by placing column \texttt{b} in \gls{hbm} during benchmark initialization, thereby simulating perfect prefetching without delay and overhead. \par


								\begin{table}[h!tb]

								    \centering

								    \input{tables/table-qdp-baseline.tex}

								    \caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Raw Time is averaged over 5 iterations with previous warm up.}

								    \label{table:qdp-baseline}

								\end{table}


								From Table \ref{table:qdp-baseline} it is obvious, that accessing column \texttt{b} through \gls{hbm} yields an increase in processing speed. We will now take a closer look at the time spent in the different pipeline stages to gain a better understanding of how this bandwidth improvement accelerates the query. \par


								\begin{figure}[h!tb]

								    \centering

								    \begin{subfigure}[t]{0.75\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-timing-dram.pdf}

								        \caption{Columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}

								        \label{fig:timing-comparison:baseline}

								    \end{subfigure}

								    \begin{subfigure}[t]{0.75\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-timing-hbm.pdf}

								        \caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}

								        \label{fig:timing-comparison:upplimit}

								    \end{subfigure}

								    \caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}

								    \label{fig:timing-comparison}

								\end{figure}


								Due to the higher bandwidth that \gls{hbm} provides \(AGGREGATE\), the CPU waits less on data from main memory, improving effective processing times. This is evident by the overall time of \(AGGREGATE\) shortening in Figure \ref{fig:timing-comparison:upplimit}, compared to the baseline from Figure \ref{fig:timing-comparison:baseline}. Therefore, more threads were tasked with \(SCAN_a\), which still accesses data from \glsentryshort{dram} and less would perform \(AGGREGATE\), compared to the scenario for data wholly in \glsentryshort{dram}. That is the reason why the former method not only outperforms \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). See raw timing values for both in the legends of Figure \ref{fig:timing-comparison}. \par


								With these comparison values we will now evaluate the results from enabling prefetching. \todo{this transition is not good} \par


								\begin{table}[h!tb]

								    \centering

								    \input{tables/table-qdp-speedup.tex}

								    \caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes. Raw Time is averaged over 5 iterations with previous warm up.}

								    \label{table:qdp-speedup}

								\end{table}


								The slowdown experienced with utilizing the \texttt{Cache} may be a surprising. However, the result becomes reasonable when we consider that for this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with \(SCAN_a\) pipeline threads, and that there is an additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase in comparison with our baseline, while not reaching the upper boundary set by simulating perfect prefetching. This proves our assumption that the \(SCAN_a\) pipeline itself is bandwidth bound, as without this contention, we see increased cache hit rate and performance. We now again examine the performance in more detail with per-pipeline timings. \par


								\begin{figure}[h!tb]

								    \centering

								    \begin{subfigure}[t]{0.75\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-timing-prefetch.pdf}

								        \caption{Prefetching with columns \texttt{a} and \texttt{b} located on the same \glsentryshort{dram} \glsentryshort{numa:node}.}

								        \label{fig:timing-results:prefetch}

								    \end{subfigure}

								    \begin{subfigure}[t]{0.75\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-timing-distprefetch.pdf}

								        \caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram} \glsentryshort{numa:node}s.}

								        \label{fig:timing-results:distprefetch}

								    \end{subfigure}

								    \caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}

								    \label{fig:timing-results}

								\end{figure}


								\todo{mention discrepancy between total times of per-pipeline and runtime: overhead in the pipelines and the outer function}


								In Figure \ref{fig:timing-results:prefetch}, the bandwidth competition between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) exhibiting significantly longer execution times. This prolonged execution duration results in prolonged overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\) operations. Consequently, despite the cache hit rate being relatively high, there is minimal speedup observed for \(AGGREGATE\), when compared with the baseline from Figure \ref{fig:timing-comparison:baseline}. The long runtime is then explained by the extended duration of \(SCAN_a\). \par


								As for the benchmark shown in Figure B, we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible increase in CPU cycles due to the overhead associated with cache utilization, visible as time spent in \(SCAN_b\). As a result, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times. Thread assignment and count also plays a role here but is not the focus of this work, as it is analysed in detail by André Berthold et. al in \cite{dimes-prefetching}. Therefore, all displayed results stem from the best-performing configuration benchmarked \todo{maybe leave this out?}. \par


								\section{Discussion}


								\begin{itemize}

								    \item Is the test a good choice? (yes, shows worst case with memory bound application and increases effects of overhead)

								    \item Are there other use cases for the cache? (yes, applications that are CPU bound -> DSA frees up CPU cycles)

								    \item Can we assume that it is possible to distribute columns over multiple nodes?

								\end{itemize}


								\todo{write this section}


								%%% Local Variables:

								%%% TeX-master: "diplom"

								%%% End: