% Kapitel wird in der Regel zuerst geschrieben und ist das Einfachste
% (oder das Schwerste weil erste).
\todo{write introductory paragraph}
This chapter introduces the relevant technologies and concepts for this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both of these terms. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.
\section{\glsentrylong{hbm}}
\label{sec:state:hbm}
\glsentrylong{hbm} is a novel memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\caption{Illustration of a simple query in (a) and the corresponding pipeline in (b). Taken from \cite[Fig. 1]{dimes-prefetching}.}
\label{fig:qdp-simple-query}
\end{figure}
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory during the execution of the pipeline. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(PROJECT_{b<-a}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching}\par
\section{\glsentrylong{dsa}}
@ -48,7 +55,7 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
\caption{\glsentrylong{dsa} Internal Architecture \cite[Fig. 1 (a)]{intel:analysis}. Shows the components that the chip is made up of, how they are connected and which outside components the \glsentryshort{dsa} communicates with.}
@ -59,11 +66,11 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
\subsubsection{Architectural Components}
\textsc{Component \rom{1}\glsentrylong{dsa:wq}:}\gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{1},\glsentrylong{dsa:wq}:}\gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{2} Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{3} Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\subsubsection{Virtual Address Resolution}
@ -81,7 +88,7 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
\caption{\glsentrylong{dsa} Software View \cite[Fig. 1 (b)]{intel:analysis}. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission.}
@ -99,7 +106,7 @@ With some limitations, like lacking support for \gls{dsa:dwq} submission, this l
As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
\caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (Intel \glsentryshort{dsa}) or software (CPU).}