Browse Source

add the speedup table from qdb bench and include it in chapter 6

master
Constantin Fürst 11 months ago
parent
commit
faa5114ba8
  1. 11
      thesis/content/60_evaluation.tex
  2. 10
      thesis/tables/table-qdpspeedup.tex

11
thesis/content/60_evaluation.tex

@ -15,12 +15,10 @@ In this chapter we will define our expectations, applying the developed Cache to
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB and divide this work over 32 groups, assigning one thread to each pipeline per group \todo{maybe describe that the high group count was chosen to allow the implementation without overlapping pipeline execution to still perform well} \todo{consider moving to overlapping pipelines if enough time is found}. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB. The work is divided over multiple groups and each group spawns threads for each pipeline step. For a fair comparison, each tested configuration uses 64 Threads for the first stage (\(SCAN_a\) and \(SCAN_b\)) and 32 for the second stage (\(AGGREGATE\)), while being restricted to run on \gls{numa:node} 0 through pinning. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
To ensure
\section{Expectations}
\label{sec:eval:expectations}
@ -77,6 +75,13 @@ Our baseline will be the performance achieved with both columns \texttt{a} and \
\todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}
\begin{table}[h!tb]
\centering
\input{tables/table-qdpspeedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
\label{table:qdp-speedup}
\end{table}
\section{Discussion}
%%% Local Variables:

10
thesis/tables/table-qdpspeedup.tex

@ -0,0 +1,10 @@
\begin{tabular}{lll}
\toprule
Configuration & Speedup & Cache Hitrate \\
\midrule
DDR-SDRAM (Baseline) & x1.00 & \textemdash \\
HBM (Upper Limit) & x1.41 & \textemdash \\
Prefetching & x0.82 & 89.38 \% \\
Prefetching, Distributed Columns & x1.23 & 93.20 \% \\
\bottomrule
\end{tabular}
Loading…
Cancel
Save