\glsentrylong{hbm} is a novel memory technology promising an increase in peak bandwidth. It is composed of stacked \acrshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is slowly being integrated into server processors, the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems may be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes while the latter utilizes the \gls{hbm} as cache for the systems \gls{dram}based main memory \cite{intel:xeonmaxbrief}. \par
\glsentrylong{hbm} is a novel memory technology that promises an increase in peak bandwidth. It consists of stacked \acrshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \gls{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\section{\glsentrylong{qdp}}
\section{\glsentrylong{qdp}}
@ -44,7 +44,7 @@
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa}promises to alleviate the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite[p. 4]{intel:xeonbrief}. To utilize the hardware optimally, knowledge of its workings is required. Therefore, we present an overview of the architecture, software, and the interaction of these two components, going into detail on the workings of the \gls{dsa} engine itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa}aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
@ -55,19 +55,28 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
\label{fig:dsa-internal-block}
\label{fig:dsa-internal-block}
\end{figure}
\end{figure}
The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface over which all communication is conducted. Through this interface, it is accessible as a PCIe device. Therefore, configuration utilizes memory-mapped registers set in the devices \gls{bar}. Through these, the devices' layout is defined and memory pages for work submission set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in 4 being present on the previously mentioned Xeon Max CPU. \todo{add citations to this section}\par
The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible as a PCIe device. Consequently, configuration utilizes memory-mapped registers set in the devices \gls{bar}. Through these registers, the devices' layout is defined and memory pages for work submission set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering.\par
To satisfy different use cases, the layout of the \gls{dsa} may be software-defined. The structure is made up of three components, namely \gls{dsa:wq}, Engine and Group. \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. An Engine is the processing-block that connects to memory and performs the described task. The grey block of Figure \ref{fig:dsa-internal-block} shows the subcomponents that make up an engine and the different internal paths for a batch or task descriptor \todo{too much detail for this being the first overview paragraph}. Using Groups, Engines and \gls{dsa:wq}s are tied together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\subsubsection{Architectural Components}
A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textrm{Component \rom{1}\glsentrylong{dsa:wq}:}\gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textrm{Component \rom{2} Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one Engine in a Group when submitting exclusively batch or task descriptors but no mixture. Even then, only write-ordering is guaranteed, meaning that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. A different issue arises, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Care must therefore be taken with read-after-write scenarios, either by waiting for a successful completion before submitting the dependant, inserting a drain descriptor for tasks or setting the fence flag for a batch. The latter two methods tell the processing engine that all writes must be committed and, in case of the fence in a batch, abort on previous error. \cite[Sec. 3.9]{intel:dsaspec}\par
\textrm{Component \rom{3} Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration.\par
An important aspect of modern computer systems is the separation of address spaces through virtual memory. Therefore, the \gls{dsa} must handle address translation, as a process submitting a task will not know the physical location in memory which causes the descriptor to contain virtual values. For this, the Engine communicates with the \gls{iommu} and \gls{atc} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. For this, knowledge about the submitting processes is required, and therefore each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq}\cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\subsubsection{Virtual Address Resolution}
The completion of a descriptor may be signalled through a completion record and interrupt, if configured so. For this, the \gls{dsa}\enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table}\cite[Sec. 3.7]{intel:dsaspec}. \par
An important aspect of modern computer systems is the separation of address spaces through virtual memory. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq}\cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\subsubsection{Completion Signalling}
\label{subsubsec:state:completion-signal}
The status of an operation on the \gls{dsa} is available in the form of a record, which is written to a memory location specified in the task descriptor. Applications can check for a change in value in this record to determine completion. Additionally, completion may be signalled by an interrupt. To facilitate this, the \gls{dsa}\enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table}\cite[Sec. 3.7]{intel:dsaspec}. \par
\subsubsection{Ordering Guarantees}
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one Engine in a Group when exclusively submitting batch or task descriptors but no mixture. Even in such cases, only write-ordering is guaranteed, implying that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. Challenges arise, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Consequently, caution is necessary in read-after-write scenarios. This can be addressed by either waiting for successful completion before submitting the dependent descriptor, inserting a drain descriptor for tasks, or setting the fence flag for a batch. The latter two methods inform the processing engine that all writes must be committed, and in case of the fence in a batch, to abort on previous error. \cite[Sec. 3.9]{intel:dsaspec}\par
\subsection{Software View}
\subsection{Software View}
\label{subsec:state:dsa-software-view}
\label{subsec:state:dsa-software-view}
@ -79,16 +88,16 @@ The completion of a descriptor may be signalled through a completion record and
\label{fig:dsa-software-arch}
\label{fig:dsa-software-arch}
\end{figure}
\end{figure}
Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc}, meaning that accessing the \gls{dsa} is only possible under Linux. To interface with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously. This can be seen in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, light green and labled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}, to achieve this. After successful configuration, each \gls{dsa:wq} is exposed as a character device by\texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc} and other operating systems. Therefore, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user-space toolset can be utilized. This application provides a command-line interface and can read configuration files to set up the device. The interaction is depicted in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, visible in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. After successful configuration, each \gls{dsa:wq} is exposed as a character device through\texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa}via either\gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why \gls{intel:dml} exists. \par
With the appropriate file permissions, a process could submit work to the \gls{dsa}using either the\gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manual configuration. However, this process can be cumbersome, which is why \gls{intel:dml} exists. \par
With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc}\par
With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc}\par
\section{Programming Interface}
\section{Programming Interface}
\label{sec:state:dml}
\label{sec:state:dml}
As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml}provides a high level interface to interact with the hardware accelerator, namely Intel \gls{dsa}. We choose to use the C++ interface and will now demonstrate its usage by example of a simple memcopy-implementation for the \gls{dsa}. \par
As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml}offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
\begin{figure}[h]
\begin{figure}[h]
\centering
\centering
@ -97,19 +106,21 @@ As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml}
\label{fig:dml-memcpy}
\label{fig:dml-memcpy}
\end{figure}
\end{figure}
In the function header of Figure \ref{fig:dml-memcpy} we notice two differences, when comparing with standard memcpy. The first is the template parameter named \texttt{path} and the second being the additional parameter \texttt{int node} which we will discuss later. With \texttt{path} the executing device, which can be the CPU or \gls{dsa}, is selected, giving the option between \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}) and \texttt{dml::automatic} where the latter dynamically selects the device at runtime, preferring \gls{dsa} over CPU execution. \cite[Sec. Quick Start]{intel:dmldoc}\par
In the function header of Figure \ref{fig:dml-memcpy} two differences from standard memcpy are notable. Firstly, there is the template parameter named \texttt{path}, and secondly, an additional parameter \texttt{int node}. Both will be discussed in the following paragraphs. \par
The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa\_run\_on\_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Subsection \ref{subsection:dsa-hwarch}, the node ID is equivalent to the ID of the \gls{dsa}. \par
\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to provide locations to the library. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)} with which we create views for both source and destination, labled \texttt{src\_view} and \texttt{dst\_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make_view(uint8_t* ptr, size_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src_view} and \texttt{dst_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
We submit a single descriptor using the asynchronous operation from \gls{intel:dml} in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem\_copy}, we again notice one element sticking out of the call. This is the addition of \texttt{.block\_on\_fault()} which lets the \gls{dsa} handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}\par
In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully.\par
After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par
A noteworthy addition to the submission-call is the use of \texttt{.block_on_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}
\section{System Setup and Configuration}\label{sec:state:setup-and-config}
\section{System Setup and Configuration}\label{sec:state:setup-and-config}
In this section we will give a brief step-by-step list of setup instructions to replicate the configuration being used for benchmarks and testing purposes in the following chapters. We found Intel's guide on \gls{dsa} usage useful but consulted articles for setup on Lenovo ThinkSystem Servers for crucial information not present in the former. Instructions for configuring the HBM access mode, as mentioned in Section \ref{sec:state:hbm}, may vary from system to system and can require extra steps not found in the list below. \par
In this section we provide a step-by-step guide to replicate the configuration being used for benchmarks and testing purposes in the following chapters. While Intel's guide on \gls{dsa} usage was a useful resource, we also consulted articles for setup on Lenovo ThinkSystem Servers for crucial information not present in the former. It is important to note that instructions for configuring the HBM access mode, as mentioned in Section \ref{sec:state:hbm}, may vary from system to system and can require extra steps not covered in the list below. \par
\begin{enumerate}
\begin{enumerate}
\item Set \enquote{Memory Hierarchy} to Flat \cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm}, \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide} and, if available, \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \cite[p. 5]{lenovo:dsa}
\item Set \enquote{Memory Hierarchy} to Flat \cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm}, \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide} and, if available, \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \cite[p. 5]{lenovo:dsa}