This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

155 lines
19 KiB

\chapter{Technical Background}
\label{chap:state}
% Hier werden zwei wesentliche Aufgaben erledigt:
% 1. Der Leser muß alles beigebracht bekommen, was er zum Verständnis
% der späteren Kapitel braucht. Insbesondere sind in unserem Fach die
% Systemvoraussetzungen zu klären, die man später benutzt. Zulässig ist
% auch, daß man hier auf Tutorials oder Ähnliches verweist, die hier auf
% dem Netz zugänglich sind.
% 2. Es muß klar werden, was anderswo zu diesem Problem gearbeitet
% wird. Insbesondere sollen natürlich die Lücken der anderen klar
% werden. Warum ist die eigene Arbeit, der eigene Ansatz wichtig, um
% hier den Stand der Technik weiterzubringen? Dieses Kapitel wird von
% vielen Lesern übergangen (nicht aber vom Gutachter ;-), auch später
% bei Veröffentlichungen ist "Related Work" eine wichtige Sache.
% Viele Leser stellen dann später fest, daß sie einige der Grundlagen
% doch brauchen und blättern zurück. Deshalb ist es gut,
% Rückwärtsverweise in späteren Kapiteln zu haben, und zwar so, daß man
% die Abschnitte, auf die verwiesen wird, auch für sich lesen
% kann. Diese Kapitel kann relativ lang werden, je größer der Kontext
% der Arbeit, desto länger. Es lohnt sich auch! Den Text kann man unter
% Umständen wiederverwenden, indem man ihn als "Tutorial" zu einem
% Gebiet auch dem Netz zugänglich macht.
% Dadurch gewinnt man manchmal wertvolle Hinweise von Kollegen. Dieses
% Kapitel wird in der Regel zuerst geschrieben und ist das Einfachste
% (oder das Schwerste weil erste).
This chapter introduces the relevant technologies and concepts for this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.
\section{\glsentrylong{hbm}}
\label{sec:state:hbm}
\begin{figure}[!b]
\centering
\includegraphics[width=0.9\textwidth]{images/hbm-design-layout.png}
\caption{\glsentrylong{hbm} Design Layout. Shows that an \glsentryshort{hbm}-module consists of stacked DRAM and a logic die. \cite{amd:hbmoverview}}
\label{fig:hbm-layout}
\end{figure}
\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. As visible in Figure \ref{fig:hbm-layout}, it consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, \enquote{HBM Flat Mode} and \enquote{HBM Cache Mode} \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\section{\glsentrylong{qdp}}
\label{sec:state:qdp}
\begin{figure}[!t]
\centering
\includegraphics[width=0.7\textwidth]{images/simple-query-graphic.pdf}
\caption{Illustration of a simple query in (a) and the corresponding pipeline in (b). \cite[Fig. 1]{dimes-prefetching}}
\label{fig:qdp-simple-query}
\end{figure}
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par
Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks. Therefore, a high degree of concurrency may be observed, resulting in demand for CPU cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. For this reason, we intend to offload copy operations to the \gls{dsa} in this work, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \cite{dimes-prefetching} \par
\section{\glsentrylong{dsa}}
\label{sec:state:dsa}
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication} \cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
\subsection{Hardware Architecture}
\label{subsection:dsa-hwarch}
\begin{figure}[!t]
\centering
\includegraphics[width=0.9\textwidth]{images/block-dsa-hwarch.pdf}
\caption{\glsentrylong{dsa} Internal Architecture. Shows the components that the chip is made up of, how they are connected and which outside components the \glsentryshort{dsa} communicates with. \cite[Fig. 1 (a)]{intel:analysis}}
\label{fig:dsa-internal-block}
\end{figure}
The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible as a PCIe device. Consequently, configuration utilizes memory-mapped registers set in the devices \gls{bar}. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par
\subsubsection{Architectural Components}
\label{subsec:state:dsa-arch-comp}
\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. The method used to achieve this guarantee may result in higher submission cost \cite[Sec. 3.3.1]{intel:dsaspec}, compared to the \gls{dsa:dwq} to which a descriptor is submitted via a regular write \cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\subsubsection{Virtual Address Resolution}
\label{subsubsec:state:dsa-vaddr}
An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the instruction used by \gls{dsa:swq} submission \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
\subsubsection{Completion Signalling}
\label{subsubsec:state:completion-signal}
The status of an operation on the \gls{dsa} is available in the form of a record, which is written to a memory location specified in the task descriptor. Applications can check for a change in value in this record to determine completion. Additionally, completion may be signalled by an interrupt. To facilitate this, the \gls{dsa} \enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table} \cite[Sec. 3.7]{intel:dsaspec}. \par
\subsubsection{Ordering Guarantees}
\label{subsubsec:state:ordering-guarantees}
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one Engine in a Group when exclusively submitting batch or task descriptors but no mixture. Even in such cases, only write-ordering is guaranteed, implying that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. Challenges arise, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Consequently, caution is necessary in read-after-write scenarios. This can be addressed by either waiting for successful completion before submitting the dependent descriptor, inserting a drain descriptor for tasks, or setting the fence flag for a batch. The latter two methods inform the processing engine that all writes must be committed, and in case of the fence in a batch, to abort on previous error. \cite[Sec. 3.9]{intel:dsaspec} \par
\subsection{Software View}
\label{subsec:state:dsa-software-view}
\begin{figure}[!t]
\centering
\includegraphics[width=0.5\textwidth]{images/block-dsa-swarch.pdf}
\caption{\glsentrylong{dsa} Software View. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission. \cite[Fig. 1 (b)]{intel:analysis}}
\label{fig:dsa-software-arch}
\end{figure}
Since the Linux Kernel version 5.10, a driver for the \gls{dsa} has been available, which currently lacks a counterpart on Windows Operating Systems \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
While a process could theoretically submit work to the \gls{dsa} by manually preparing descriptors and submitting them via special instructions, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par
\section{Programming Interface for \glsentrylong{dsa}}
\label{sec:state:dml}
\ref{sec:}
As mentioned in Section \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
\begin{figure}[!t]
\centering
\includegraphics[width=0.9\textwidth]{images/nsd-dsamemcpy.pdf}
\caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (Intel \glsentryshort{dsa}) or software (CPU).}
\label{fig:dml-memcpy}
\end{figure}
In the function header of Figure \ref{fig:dml-memcpy} two differences from standard memcpy are notable. Firstly, there is the template parameter named \texttt{path}, and secondly, an additional parameter \texttt{int node}. Both will be discussed in the following paragraphs. \par
The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or, or by using optional parameters of \texttt{dml::submit} \cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on the given node. With it only being an example, potential side effects, arising from modification of \glsentryshort{numa}-assignment, of calling this pseudocode are not relevant. \par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par
In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par
A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}. \par
\section{System Setup and Configuration} \label{sec:state:setup-and-config}
In this section we provide a step-by-step guide to replicate the configuration being used for benchmarks and testing purposes in the following chapters. While Intel's guide on \gls{dsa} usage was a useful resource, we also consulted articles for setup on Lenovo ThinkSystem Servers for crucial information not present in the former. It is important to note that instructions for configuring the HBM access mode, as mentioned in Section \ref{sec:state:hbm}, may vary from system to system and can require extra steps not covered in the list below. \par
\begin{enumerate}
\item Set \enquote{Memory Hierarchy} to Flat \cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm}, \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide} and, if available, \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \cite[p. 5]{lenovo:dsa}
\item Use a kernel with IDXD driver support, available from Linux 5.10 or later \cite[Sec. Installation]{intel:dmldoc} and append the following to the kernel boot parameters in grub config: \texttt{intel\_iommu=on,sm\_on} \cite[p. 5]{lenovo:dsa}
\item Evaluate correct detection of \gls{dsa} devices using \texttt{dmesg | grep idxd} which should list as many devices as NUMA nodes on the system \cite[p. 5]{lenovo:dsa}
\item Install \texttt{accel-config} from repository \cite{intel:libaccel-config-repo} or system package manager and inspect the detection of \gls{dsa} devices through the driver using \texttt{accel-config list -i} \cite[p. 6]{lenovo:dsa}
\item Create \gls{dsa} configuration file for which we provide an example under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo} that is also applied for the benchmarks. Then apply the configuration using \texttt{accel-config load-config -c [filename] -e} \cite[Fig. 3-9]{intel:dsaguide}
\item Inspect the now configured \gls{dsa} devices using \texttt{accel-config list} \cite[p. 7]{lenovo:dsa}, output should match the desired configuration set in the file used
\end{enumerate}
%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: