% bezüglich der Ergebnisse zu erläutern und anschließend eventuell
% festgestellte Abweichungen zu erklären.
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
\todo{improve the sections in this chapter}
\todo{qdp bench code for now uses only the 4 on-node dsas, check (given time) whether round-robinning to all available makes a positive difference}
In this chapter, we establish anticipated outcomes for incorporating the developed Cache into the context of \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the choices made regarding the benchmarking methodology and their influence on our results. \par
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations\(SCAN_a\)for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\)for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\)for the projection and final summation step. We use a column size of 4 GiB. The work is divided over multiple groups and each group spawns threads for each pipeline step. For a fair comparison, each tested configuration uses 64 Threads for the first stage (\(SCAN_a\) and \(SCAN_b\)) and 32 for the second stage (\(AGGREGATE\)), while being restricted to run on \gls{numa:node} 0 through pinning. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote\(SCAN_a\)as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\)as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\)as the projection and final summation step. The column size utilized is set at 4 GiB. The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching. \par
Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let\(SCAN_b\)work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing\(SCAN_b\)to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. \par
\section{Expectations}
\label{sec:eval:expectations}
The simple query presents a challenging scenario to the \texttt{Cache}. As the filter operation applied to column \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the \texttt{Cache} has little time during which it must prefetch, which will amplify delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Additionally, it can be assumed that the task is memory bound. As the prefetching of \texttt{b} in \(SCAN_b\) and the load and subsequent filter of \texttt{a} in \(SCAN_a\) will execute in parallel, caching therefore directly reduces the memory bandwidth available to \(SCAN_a\), when both columns are located on the same \gls{numa:node}. \par
Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that the \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly diminishes the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}.
\section{Observations}
\label{sec:eval:observations}
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, used as reference in evaluation of the results using our \texttt{Cache}. Two methods were benchmarked, representing a baseline and upper limit for what we can achieve with prefetching. For the former, all columns were located in \glsentryshort{dram}. The latter is achieved by placing column \texttt{b} in \gls{hbm} during benchmark initialization, thereby simulating perfect prefetching without delay and overhead. \par
In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influence performance \cite{dimes-prefetching}, which however is not inside the scope of this work. Therefore, we only present the best benchmarked results for each configuration. \par
\subsection{Benchmarks without Prefetching}
We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
\begin{table}[h!tb]
\centering
@ -41,17 +40,20 @@ In this section we will present our findings from applying the \texttt{Cache} de
\label{table:qdp-baseline}
\end{table}
From Table \ref{table:qdp-baseline} it is obvious, that accessing column \texttt{b} through \gls{hbm} yields an increase in processing speed. We will now take a closer look at the time spent in the different pipeline stages to gain a better understanding of how this bandwidth improvement accelerates the query. \par
From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different pipeline stages. \par
The following plots are normalized so that the longest execution from Figures \ref{fig:timing-comparison} and \ref{fig:timing-results} fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-pipeline timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and Figures may be observed. \par
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
@ -61,9 +63,11 @@ From Table \ref{table:qdp-baseline} it is obvious, that accessing column \texttt
\label{fig:timing-comparison}
\end{figure}
Due to the higher bandwidth that \gls{hbm} provides \(AGGREGATE\), the CPU waits less on data from main memory, improving effective processing times. This is evident by the overall time of \(AGGREGATE\) shortening in Figure \ref{fig:timing-comparison:upplimit}, compared to the baseline from Figure \ref{fig:timing-comparison:baseline}. Therefore, more threads were tasked with \(SCAN_a\), which still accesses data from \glsentryshort{dram} and less would perform \(AGGREGATE\), compared to the scenario for data wholly in \glsentryshort{dram}. That is the reason why the former method not only outperforms \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). See raw timing values for both in the legends of Figure \ref{fig:timing-comparison}. \par
Due to the higher bandwidth provided by \gls{hbm} for \(AGGREGATE\), the CPU waits less for data from main memory, thereby improving processing times. This is evident in the overall shorter time taken for \(AGGREGATE\) in Figure \ref{fig:timing-comparison:upplimit} compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with aggregate requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par
\subsection{Benchmarks using Prefetching}
With these comparison values we will now evaluate the results from enabling prefetching. \todo{this transition is not good}\par
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario, through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup.\par
\begin{table}[h!tb]
\centering
@ -72,17 +76,18 @@ With these comparison values we will now evaluate the results from enabling pref
\label{table:qdp-speedup}
\end{table}
The slowdown experienced with utilizing the \texttt{Cache} may be a surprising. However, the result becomes reasonable when we consider that for this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with \(SCAN_a\) pipeline threads, and that there is an additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase in comparison with our baseline, while not reaching the upper boundary set by simulating perfect prefetching. This proves our assumption that the \(SCAN_a\) pipeline itself is bandwidth bound, as without this contention, we see increased cache hit rate and performance. We now again examine the performance in more detail with per-pipeline timings. \par
The slowdown below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, this result becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the\(SCAN_a\) pipeline threads, and there is additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase compared to our baseline, although it does not reach the upper boundary set by simulating perfect prefetching. This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram}\glsentryshort{numa:node}s.}
@ -92,11 +97,9 @@ The slowdown experienced with utilizing the \texttt{Cache} may be a surprising.
\label{fig:timing-results}
\end{figure}
\todo{mention discrepancy between total times of per-pipeline and runtime: overhead in the pipelines and the outer function}
In Figure \ref{fig:timing-results:prefetch}, the bandwidth competition between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) exhibiting significantly longer execution times. This prolonged execution duration results in prolonged overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\) operations. Consequently, despite the cache hit rate being relatively high, there is minimal speedup observed for \(AGGREGATE\), when compared with the baseline from Figure \ref{fig:timing-comparison:baseline}. The long runtime is then explained by the extended duration of \(SCAN_a\). \par
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. This prolonged duration of execution leads to extended overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate, minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
As for the benchmark shown in Figure B, we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible increase in CPU cycles due to the overhead associated with cache utilization, visible as time spent in \(SCAN_b\). As a result, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times. Thread assignment and count also plays a role here but is not the focus of this work, as it is analysed in detail by André Berthold et. al in \cite{dimes-prefetching}. Therefore, all displayed results stem from the best-performing configuration benchmarked \todo{maybe leave this out?}. \par
Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par