rework chapter 2 with suggestions after review by andre

11 months ago · 97acc7403c
2 changed files with 14 additions and 22 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -41,55 +41,47 @@

 \section{Intel Data Streaming Accelerator}

-\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[15]{intel:dsaspec}}
+\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}

-Introduced with the 4th generation of Intel Xeon Scalable Processors \cite{intel:xeonbrief}, the \gls{dsa} promises to alleviate the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication} \cite{intel:xeonbrief}. This chapter will give an overview of the architecture, software and the interaction of these two components. The reader will be familiarized with the setup and equipped with the knowledge to configure the system for a specific use case.
-
-\todo{consider adding projected use cases as in the architecture specification here}
-To be able to optimally utilize the Hardware, knowledge of its workings is required to make educated decisions. Therefore, this section describes both the workings of the \gls{dsa} engine itself and the view that is presented through software interfaces. All statements are based on Chapter 3 of the Architecture Specification by Intel \cite{intel:dsaspec}. \par
+Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} promises to alleviate the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication} \cite[p. 4]{intel:xeonbrief}. To utilize the hardware optimally, knowledge of its workings is required. Therefore, we present an overview of the architecture, software, and the interaction of these two components, going into detail on the workings of the \gls{dsa} engine itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par

 \subsection{Hardware Architecture} \label{subsection:dsa-hwarch}

 \begin{figure}[H]
    \centering
    \includegraphics[width=0.9\textwidth]{images/dsa-internal-block-diagram.png}
-    \caption{\\ \acrshort{dsa} Internal Archtiecture Block Diagramm \\ Taken from Figure 1a of \cite{intel:analysis}}
+    \caption{DSA Internal Architecture \cite[Fig. 1 (a)]{intel:analysis}}
    \label{fig:dsa-internal-block}
 \end{figure}

-The accelerator is directly integrated into the Processor and attaches via the I/O fabric interface over which all communication is conducted. Over this interface, it is accessible as a PCIe device. Configuration therefore is done through memory-mapped registers set in the devices \gls{bar}. Through these, the devices' layout is defined and memory pages for work submission are set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node.\par
+The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface over which all communication is conducted. Through this interface, it is accessible as a PCIe device. Therefore, configuration utilizes memory-mapped registers set in the devices \gls{bar}. Through these, the devices' layout is defined and memory pages for work submission set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in 4 being present on the previously mentioned Xeon Max CPU. \par

-To satisfy different use cases, as already mentioned, the layout of the \gls{dsa} may be software-defined. The structure is made up of three components, namely \gls{dsa:wq}s, \gls{dsa:engine}s and \gls{dsa:group}s. \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. An \gls{dsa:engine} is the processing-block that connects to memory and performs the described task. Using \gls{dsa:group}s, \gls{dsa:engine}s and \gls{dsa:wq}s are tied together. This means, that tasks from one \gls{dsa:wq} may be processed from multiple \gls{dsa:engine}s and that vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter which connects the two components and acts according to the setup. \par
+To satisfy different use cases, the layout of the \gls{dsa} may be software-defined. The structure is made up of three components, namely \gls{dsa:wq}, \gls{dsa:engine} and \gls{dsa:group}. \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. An \gls{dsa:engine} is the processing-block that connects to memory and performs the described task. The grey block of Figure \ref{fig:dsa-internal-block} shows the subcomponents that make up an engine and the different internal paths for a batch or task descriptor. Using \gls{dsa:group}s, \gls{dsa:engine}s and \gls{dsa:wq}s are tied together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple \gls{dsa:engine}s and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par

-A \gls{dsa:wq} is accessible through so-called portals, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these portals. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue of which there are two types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing. This results in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}. The \gls{dsa:dwq} is therefore more performant but may require access control mechanisms and may only be accessed by one process at a time. \par
+A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

-To handle the different descriptors, each \gls{dsa:engine} has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within. For a batch, the \gls{dsa} first reads the batch descriptor, then fetches all task descriptors for the batch from memory and processes them. An \gls{dsa:engine} can also trigger a page fault when trying to access an unloaded page and wait on its completion, if configured to do so. Otherwise, an error will be generated in this scenario. \par
+To handle the different descriptors, each \gls{dsa:engine} has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors for the batch from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An \gls{dsa:engine} can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par

-Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one \gls{dsa:engine} in a \gls{dsa:group} when submitting exclusively batch or task descriptors but no mixture. Even then, only write-ordering is guaranteed, meaning that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor} \cite[30]{intel:dsaspec}. A different issue arises, should an operation fail: the \gls{dsa} will continue to process the following descriptors. Care must therefore be taken with read-after-write scenarios, either by waiting for a successful completion before submitting the dependant, inserting a drain descriptor for tasks or setting the fence flag for a batch. The latter two methods tell the processing engine that all writes must be committed and, in case of the fence in a batch, abort on previous error. \par
+Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one \gls{dsa:engine} in a \gls{dsa:group} when submitting exclusively batch or task descriptors but no mixture. Even then, only write-ordering is guaranteed, meaning that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. A different issue arises, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Care must therefore be taken with read-after-write scenarios, either by waiting for a successful completion before submitting the dependant, inserting a drain descriptor for tasks or setting the fence flag for a batch. The latter two methods tell the processing engine that all writes must be committed and, in case of the fence in a batch, abort on previous error. \cite[Sec. 3.9]{intel:dsaspec} \par

-An important aspect of modern computer systems is the separation of address spaces through virtual memory. The \gls{dsa} must therefore handle address translation, as a process submitting a task will not know the physical location in memory which causes the descriptor to contain virtual values. For this, the \gls{dsa:engine} communicates with the \gls{iommu} and \gls{atc} to perform this operation. For this, knowledge about the submitting processes is required, and therefore each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} or set statically after a process is attached to a \gls{dsa:dwq}. \par
+An important aspect of modern computer systems is the separation of address spaces through virtual memory. Therefore, the \gls{dsa} must handle address translation, as a process submitting a task will not know the physical location in memory which causes the descriptor to contain virtual values. For this, the \gls{dsa:engine} communicates with the \gls{iommu} and \gls{atc} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. For this, knowledge about the submitting processes is required, and therefore each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

-The completion of a descriptor may be signalled through a completion record and interrupt, if configured so. For this, the \gls{dsa} \enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table} \cite[27]{intel:dsaspec}. \par
+The completion of a descriptor may be signalled through a completion record and interrupt, if configured so. For this, the \gls{dsa} \enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table} \cite[Sec. 3.7]{intel:dsaspec}. \par

 \subsection{Software View}

 \begin{figure}[H]
    \centering
    \includegraphics[width=0.5\textwidth]{images/dsa-software-architecture.png}
-    \caption{\\ \acrshort{dsa} Software View Block Diagramm \\ Taken from Figure 1a of \cite{intel:analysis}}
+    \caption{DSA Software View \cite[Fig. 1 (b)]{intel:analysis}}
    \label{fig:dsa-software-arch}
 \end{figure}

-Due to efforts by intel programmers, since Linux Kernel 5.10 \cite[Installation Instructinos]{intel:dmldoc}, there exists a driver for the \gls{dsa} \cite{intel:idxd-driver-repo} which has no counterpart in the Windows OS-Family \cite[Installation Instructinos]{intel:dmldoc}, meaning code developed without an alternative path will not work there. To interface with the driver and perform configuration operations, Intel's libaccel-conf \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[3]{intel:analysis}. \par
-
-Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why Intel's Data Mover Library \cite{intel:dmldoc} exists. With some limitations (like lacking support for \gls{dsa:dwq}s) this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \par
+Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc}, meaning that accessing the \gls{dsa} is not possible to user space applications. To interface with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously, this can be seen in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, light green and labled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}, to achieve this. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par

-\todo{finish this section}
+Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why Intel's Data Mover Library exists. \par

-\begin{itemize}
-    \item drain descriptor / drain command signals completion of preceding descriptors for fencing in non-batch submissions, in batches the ``fence flag'` can be used to ensure ordering, failures before a fence will lead to the following descriptors being aborted \cite[30]{intel:dsaspec}, \texttt{sfence} or \texttt{mfence} should be executed before pushing drain descriptor \cite[32]{intel:dsaspec}
-    \item cache control flag in descriptor controls whether writes are directed to cache or to memory \cite[31]{intel:dsaspec} effects on copy from DRAM > HBM unknown
-\end{itemize}
+With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par

 \subsection{Programming Interface}