@ -36,6 +36,8 @@ This chapter introduces the relevant technologies and concepts for this thesis.
\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\todo{find a nice graphic for hbm}
\section{\glsentrylong{qdp}}
\begin{figure}[h!tb]
@ -53,6 +55,10 @@ This chapter introduces the relevant technologies and concepts for this thesis.
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
\todo{comment by andre: Wirkt mir etwas knapp, vllt etwas mehr erklären. Z.B.: auch die Problematik mit dem Aufwand um das Kopieren schnell genug zu machen (viele Threads) -> das gibt dann auch den Ansatz punkt um zu sagen, was wir uns von der DSA erhoffen (Kontext, warum wir QdP überhaupt einführen). Und z.B.: welchen Trick wir angewandt haben um den Kopieraufwand gering zu halten.}
@ -210,6 +210,8 @@ We discovered an anomaly for \gls{numa:node} 12 for which we did not find an exp
Scaling was found to be less than linear. No conclusive answer was given for this either. We assumed that this happens due to increased congestion of the interconnect. \par
\todo{mention measures undertaken to find an explanation here}
Even though we could not find an explanation for all measurements, this chapter still gives insight into the performance of the \gls{dsa}, its strengths and its weaknesses. It provides data-driven guidance on a complex architecture, helping to find the optimum for applying the \gls{dsa} to our expected and possibly different workloads. \par
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
\todo{improve the sections in this chapter}
\section{Benchmarked Task}
\label{sec:eval:bench}
@ -28,7 +30,16 @@ Due to the challenges posed by sharing memory bandwidth we will benchmark prefet
\section{Observations}
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, representing the upper and lower boundaries respectively. \par
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, used as reference in evaluation of the results using our \texttt{Cache}. Two methods were benchmarked, representing a baseline and upper limit for what we can achieve with prefetching. For the former, all columns were located in \glsentryshort{dram}. The latter is achieved by placing column \texttt{b} in \gls{hbm} during benchmark initialization, thereby simulating perfect prefetching without delay and overhead. \par
\begin{table}[h!tb]
\centering
\input{tables/table-qdp-baseline.tex}
\caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Raw Time is averaged over 5 iterations with previous warm up.}
\label{table:qdp-baseline}
\end{table}
From Table \ref{table:qdp-baseline} it is obvious, that accessing column \texttt{b} through \gls{hbm} yields an increase in processing speed. We will now take a closer look at the time spent in the different pipeline stages to gain a better understanding of how this bandwidth improvement accelerates the query. \par
\begin{figure}[h!tb]
\centering
@ -48,7 +59,18 @@ In this section we will present our findings from applying the \texttt{Cache} de
\label{fig:timing-comparison}
\end{figure}
Our baseline will be the performance achieved with both columns \texttt{a} and \texttt{b} located in \glsentryshort{dram} and no prefetching. The upper limit will be represented by measuring the scenario where \texttt{b} is already located in \gls{hbm} at the start of the benchmark, simulating prefetching with no overhead or delay. \par
Due to the higher bandwidth that \gls{hbm} provides \(AGGREGATE\), the CPU waits less on data from main memory, improving effective processing times. This is evident by the overall time of \(AGGREGATE\) shortening in Figure \ref{fig:timing-comparison:upplimit}, compared to the baseline from Figure \ref{fig:timing-comparison:baseline}. Therefore, more threads were tasked with \(SCAN_a\), which still accesses data from \glsentryshort{dram} and less would perform \(AGGREGATE\), compared to the scenario for data wholly in \glsentryshort{dram}. That is the reason why the former method not only outperforms \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). See raw timing values for both in the legends of Figure \ref{fig:timing-comparison}. \par
With these comparison values we will now evaluate the results from enabling prefetching. \par
\begin{table}[h!tb]
\centering
\input{tables/table-qdp-speedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes. Raw Time is averaged over 5 iterations with previous warm up.}
\label{table:qdp-speedup}
\end{table}
The slowdown experienced with utilizing the \texttt{Cache} may be a surprising. However, the result becomes reasonable when we consider that for this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with \(SCAN_a\) pipeline threads, and that there is an additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase in comparison with our baseline, while not reaching the upper boundary set by simulating perfect prefetching. This proves our assumption that the \(SCAN_a\) pipeline itself is bandwidth bound, as without this contention, we see increased cache hit rate and performance. We now again examine the performance in more detail with per-pipeline timings. \par
\begin{figure}[h!tb]
\centering
@ -68,22 +90,22 @@ Our baseline will be the performance achieved with both columns \texttt{a} and \
\label{fig:timing-results}
\end{figure}
\begin{itemize}
\item Fig \ref{fig:timing-results:distprefetch} aggr for prefetch theoretically at level of hbm but due to more concurrent workload aggr gets slowed down too
\item Fig \ref{fig:timing-results:prefetch} scana increase due to sharing bandwidth reasonable, aggr seems a bit unreasonable
\end{itemize}
\todo{mention discrepancy between total times of per-pipeline and runtime: overhead in the pipelines and the outer function}
\todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}
In Figure \ref{fig:timing-results:prefetch}, the bandwidth competition between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) exhibiting significantly longer execution times. This prolonged execution duration results in prolonged overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\) operations. Consequently, despite the cache hit rate being relatively high, there is minimal speedup observed for \(AGGREGATE\), when compared with the baseline from Figure \ref{fig:timing-comparison:baseline}. The long runtime is then explained by the extended duration of \(SCAN_a\). \par
\begin{table}[h!tb]
\centering
\input{tables/table-qdpspeedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
\label{table:qdp-speedup}
\end{table}
As for the benchmark shown in Figure B, we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible increase in CPU cycles due to the overhead associated with cache utilization, visible as time spent in \(SCAN_b\). As a result, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times. Thread assignment and count also plays a role here but is not the focus of this work, as it is analysed in detail by André Berthold et. al in \cite{dimes-prefetching}. Therefore, all displayed results stem from the best-performing configuration benchmarked \todo{maybe leave this out?}. \par
\section{Discussion}
\begin{itemize}
\item Is the test a good choice? (yes, shows worst case with memory bound application and increases effects of overhead)
\item Are there other use cases for the cache? (yes, applications that are CPU bound -> DSA frees up CPU cycles)
\item Can we assume that it is possible to distribute columns over multiple nodes?