% bezüglich der Ergebnisse zu erläutern und anschließend eventuell
% festgestellte Abweichungen zu erklären.
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}. To measure the performance, we adapted code developed by colleagues André Berthold and Anna Bartuschka for evaluating \gls{qdp} in \cite{dimes-prefetching}. \par
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB and divide this work over 32 groups, assigning one thread to each pipeline per group \todo{maybe describe that the high group count was chosen to allow the implementation without overlapping pipeline execution to still perform well}\todo{consider moving to overlapping pipelines if enough time is found}. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
To ensure
\section{Expectations}
\label{sec:eval:expectations}
The simple query presents a challenging scenario to the \texttt{Cache}. As the filter operation applied to column \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the \texttt{Cache} has little time during which it must prefetch, which will amplify delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Additionally, it can be assumed that the task is memory bound. As the prefetching of \texttt{b} in \(SCAN_b\) and the load and subsequent filter of \texttt{a} in \(SCAN_a\) will execute in parallel, caching therefore directly reduces the memory bandwidth available to \(SCAN_a\), when both columns are located on the same \gls{numa:node}. \par
Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
\section{Observations}
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, representing the upper and lower boundaries respectively. \par
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
\label{fig:timing-comparison:upplimit}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
\label{fig:timing-comparison}
\end{figure}
The benchmark executes a simple query as illustrated in Figure \ref{fig:eval-simple-query} which presents a challenging scenario to the cache. As the filter operation applied to \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the Cache has little time during which it must prefetch, which will amplify delays caused by processing overhead in the Cache itself or from submission to the Work Queue. This makes the chosen query suited to stress test the developed solution. \par
Our baseline will be the performance achieved with both columns \texttt{a} and \texttt{b} located in \glsentryshort{dram} and no prefetching. The upper limit will be represented by measuring the scenario where \texttt{b} is already located in \gls{hbm} at the start of the benchmark, simulating prefetching with no overhead or delay. \par
With this difficult scenario, we expect to spend time analysing runtime behaviour of our benchmark in order to optimize the Cache and the way it is applied to the query. Optimizations should yield slight performance improvement over the baseline, using DRAM, and will not reach the theoretical peak, where the data for \texttt{b} resides in HBM. \par
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram}\glsentryshort{numa:node}s.}
\label{fig:timing-results:distprefetch}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
\label{fig:timing-results}
\end{figure}
Consider using parts of flamegraph. Same speed as dram, even though allocation is performed in the timed region and not before. Mention dml performs busy waiting (cite dsa-paper 4.4 for use of interrupts mentioned in arch), optimization with weak wait. Mention optimization weak access for prefetching scenario.
\begin{itemize}
\item Fig \ref{fig:timing-results:distprefetch} aggr for prefetch theoretically at level of hbm but due to more concurrent workload aggr gets slowed down too
\item Fig \ref{fig:timing-results:prefetch} scana increase due to sharing bandwidth reasonable, aggr seems a bit unreasonable
\end{itemize}
Scan A is memory bound, therefore copying B from DRAM to HBM directly cuts into available bandwidth to A when both are located on the same node. Better performance when A and B are stored on different nodes.
\todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}