@ -15,12 +15,10 @@ In this chapter we will define our expectations, applying the developed Cache to
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB and divide this work over 32 groups, assigning one thread to each pipeline per group \todo{maybe describe that the high group count was chosen to allow the implementation without overlapping pipeline execution to still perform well}\todo{consider moving to overlapping pipelines if enough time is found}. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB. The work is divided over multiple groups and each group spawns threads for each pipeline step. For a fair comparison, each tested configuration uses 64 Threads for the first stage (\(SCAN_a\) and \(SCAN_b\)) and 32 for the second stage (\(AGGREGATE\)), while being restricted to run on \gls{numa:node} 0 through pinning. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
To ensure
\section{Expectations}
\label{sec:eval:expectations}
@ -77,6 +75,13 @@ Our baseline will be the performance achieved with both columns \texttt{a} and \
\todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}
\begin{table}[h!tb]
\centering
\input{tables/table-qdpspeedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}