\glsentrylong{hbm} is a novel memory technology promising an increase in peak bandwidth. It is composed of stacked DRAM dies \cite[p. 1]{hbm-arch-paper} and is slowly being integrated into server processors, notably the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief}. \gls{hbm} on these systems may be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes while the latter utilizes the \gls{hbm} as cache for the systems DDR based main memory \cite{intel:xeonmaxbrief}. \par
@ -84,7 +85,7 @@ Given the file permissions, it would now be possible for a process to submit wor
With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc}\par
\subsection{Programming Interface}
\section{Programming Interface}
As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} provides a high level interface to interact with the hardware accelerator, namely Intel \gls{dsa}. We choose to use the C++ interface and will now demonstrate its usage by example of a simple memcopy-implementation for the \gls{dsa}. \par
@ -97,32 +98,31 @@ As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml}
In the function header of Figure \ref{fig:dml-memcpy} we notice two differences, when comparing with standard memcpy. The first is the template parameter named \texttt{path} and the second being the additional parameter \texttt{int node} which we will discuss later. With \texttt{path} the executing device, which can be the CPU or \gls{dsa}, is selected, giving the option between \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}) and \texttt{dml::automatic} where the latter dynamically selects the device at runtime, preferring \gls{dsa} over CPU execution. \cite[Sec. Quick Start]{intel:dmldoc}\par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{sec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa_run_on_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{sec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa\_run\_on\_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par
\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make_view(uint8_t* ptr, size_t size)} with which we create views for both source and destination, labled \texttt{src_view} and \texttt{dst_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)} with which we create views for both source and destination, labled \texttt{src\_view} and \texttt{dst\_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
For submission, we chose to use the asynchronous style of a single descriptor in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit<path>}, which takes an operation and operation specific parameters in and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem_copy}, we again notice one thing sticking out of the call. This is the addition to the operation specifier \texttt{.block\_on\_fault()} which submits the operation, so that it will handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}\par
For submission, we chose to use the asynchronous style of a single descriptor in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit<path>}, which takes an operation and operation specific parameters in and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem\_copy}, we again notice one thing sticking out of the call. This is the addition to the operation specifier \texttt{.block\_on\_fault()} which submits the operation, so that it will handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}\par
After submission, we poll for the task completion with \texttt{handler.get()} in Figure \ref{fig:dml-memcpy} and check whether the operation completed successfully. \par
\section{System Setup and Configuration}\label{sec:state:setup-and-config}
\todo{write this section}
Give the reader the tools to replicate the setup.
Also explain why the BIOS-configs are required.
Setup Requirements:
\begin{itemize}
\item VT-d enabled
\item limit CPUPA to 46 Bits disabled
\item IOMMU enabled
\item kernel with iommu and idxd driver support
\item kernel option "intel\_iommu=on,sm\_on"
\item numa nodes for hbm access in bios
\end{itemize}
\cleardoublepage
In this section we will give a brief step-by-step list of setup instructions to replicate the configuration being used for benchmarks and testing purposes in the following chapters. We found Intel's guide on \gls{dsa} usage useful but consulted articles for setup on Lenovo ThinkSystem Servers for crucial information not present in the former. Instructions for configuring the HBM access mode, as mentioned in Section \ref{sec:state:hbm}, may vary from system to system and can require extra steps not found in the list below. \par
\begin{enumerate}
\item Set \enquote{Memory Hierarchy} to Flat in BIOS \\\cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm}
\item Set \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide}
\item If available, set the option \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \\\cite[p. 5]{lenovo:dsa}
\item Use a kernel with IDXD driver support, available from Linux 5.10 or later \\\cite[Sec. Installation]{intel:dmldoc}
\item Append the following to the kernel boot parameters in grub config: \\\texttt{intel\_iommu=on,sm\_on}\cite[p. 5]{lenovo:dsa}
\item Evaluate correct detection of \gls{dsa} devices using \texttt{dmesg | grep idxd} which should list as many devices as NUMA nodes on the system \cite[p. 5]{lenovo:dsa}
\item Install \texttt{accel-config} from repository \cite{intel:libaccel-config-repo} or other package manager
\item Inspect the not yet configured \gls{dsa} devices using \texttt{accel-config list -i}\\\cite[p. 6]{lenovo:dsa}
\item Create \gls{dsa} configuration file which may be based upon the one used for most benchmarks available under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo}
\item Apply the configuration file from the previous step using \texttt{accel-config load-config -c [filename] -e}\cite[Fig. 3-9]{intel:dsaguide}
\item Inspect the now configured \gls{dsa} devices using \texttt{accel-config list}\cite[p. 7]{lenovo:dsa}, output should match the desired configuration set in the file provided to \texttt{accel-config load-config}