fix missing detail on the first processor with dsa, add todos and clear the bullet points of stuff that has already been added to the text - all in technical background chapter
% Kapitel wird in der Regel zuerst geschrieben und ist das Einfachste
% (oder das Schwerste weil erste).
\section{Intel DSA Background}
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[15]{intel:dsaspec}}
Introduced with the 4th generation of Intel Processors \cite{intel:xeonbrief}, the Intel Data Streaming Accelerator promises to alleviate the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite{intel:xeonbrief}. This chapter will give an overview of the architecture, software and the interaction of these two components. The reader will be familiarized with the setup and equipped with the knowledge to configure the system for a specific use case.
Introduced with the 4th generation of Intel Xeon Scalable Processors \cite{intel:xeonbrief}, the \gls{dsa} promises to alleviate the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite{intel:xeonbrief}. This chapter will give an overview of the architecture, software and the interaction of these two components. The reader will be familiarized with the setup and equipped with the knowledge to configure the system for a specific use case.
\todo{consider adding projected use cases as in the architecture specification here}
\section{Intel DSA Architecture}
\section{Architecture}
To be able to optimally utilize the Hardware, knowledge of its workings is required to make educated decisions. Therefore, this section describes both the workings of the \gls{dsa} engine itself (referred to as internal architecture) and the way it integrates with the rest of the processor (external architecture). All statements are based on Chapter 3 of the Architecture Specification by Intel \cite{intel:dsaspec}.
To be able to optimally utilize the Hardware, knowledge of its workings is required to make educated decisions. Therefore, this section describes both the workings of the \gls{dsa} engine itself (referred to as internal architecture) and the way it integrates with the rest of the processor (external architecture). All statements are based on Chapter 3 of the Architecture Specification by Intel \cite{intel:dsaspec}.\par
As the accelerator is directly integrated into the CPU, a system with multiple processors, as it is common in servers, will also have multiple \gls{dsa}s. These engines are accessible via the CPUs IO-Fabric as a PCIe device, and submit memory requests through this BUS directly to the \gls{iommu}. Configuration of the device on a low level is done through memory-mapped I/O registers that are set in the \gls{bar}, which is also used to set the location of work submission portals. Through these portals, the so-called work descriptors are handed over to the device for processing.
As the accelerator is directly integrated into the CPU, a system with multiple processors, as it is common in servers, will also have multiple \gls{dsa}s. These engines are accessible via the CPUs IO-Fabric as a PCIe device, and submit memory requests through this BUS directly to the \gls{iommu}. Configuration of the device on a low level is done through memory-mapped I/O registers that are set in the \gls{bar}, which is also used to set the location of work submission portals. Through these portals, the so-called work descriptors are handed over to the device for processing.\par
\begin{itemize}
\item\gls{dsa} is directly linked to the IOMMU to submit memory requests \cite[22]{intel:dsaspec}
\item\gls{dsa} is controlled over commands sent over the IO Fabric and is available as PCIe-like-device \cite[21]{intel:dsaspec}
\item configuration is done via Memory-Mapped IO registers configured through PCIe bar registers \cite[21]{intel:dsaspec}
\item possibly more performance with multiple engines per group (and single WQ) to cover over high latency address translation \cite[25]{intel:dsaspec}
\item drain descriptor / drain command signals completion of preceding descriptors for fencing in non-batch submissions, in batches the ``fence flag'` can be used to ensure ordering, failures before a fence will lead to the following descriptors being aborted \cite[30]{intel:dsaspec}, \texttt{sfence} or \texttt{mfence} should be executed before pushing drain descriptor \cite[32]{intel:dsaspec}
\item cache control flag in descriptor controls whether writes are directed to cache or to memory \cite[31]{intel:dsaspec} effects on copy from DRAM > HBM unknown
\item shared WQ receive work via 'PCIe deferrable memory write request' to the portal which removes the need for synchronization of submissions but can cost more due to the communication overhead of posting a write request and waiting for it to be signalled 'completed' \cite[23]{intel:dsaspec}
\item dedicated WQ receive work via atomic 64bit write to the portal \cite[24]{intel:dsaspec}
\item dedicated WQ are configured by the driver with a specified PASID for address translation and can not be shared by multiple clients \cite[24]{intel:dsaspec}
\end{itemize}
\section{Intel DSA HW/SW Setup}
\section{HW/SW Setup}
Give the reader the tools to replicate the setup.
Also explain why the BIOS-configs are required.
@ -76,7 +70,13 @@ Software Access:
Explain how a piece of software may access the \gls{dsa}/WQ, how the drivers and dsa libraries enable this
and also how access policies are enforced.
\section{Intel DSA Microbenchmarks}
\section{Microbenchmarks}
\todo{provide microbenchmarks with multiple configurations and for many use cases}
\section{Evaluation}
\todo{evaluate the benchmarks and conclude with projected use cases - may use the cases from dsaspec/guide}