diff --git a/thesis/content/20_state.tex b/thesis/content/20_state.tex index e125adc..48fb42a 100644 --- a/thesis/content/20_state.tex +++ b/thesis/content/20_state.tex @@ -47,7 +47,7 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \subsection{Hardware Architecture} \label{subsection:dsa-hwarch} -\begin{figure}[H] +\begin{figure}[h] \centering \includegraphics[width=0.9\textwidth]{images/dsa-internal-block-diagram.png} \caption{DSA Internal Architecture \cite[Fig. 1 (a)]{intel:analysis}} @@ -69,8 +69,9 @@ An important aspect of modern computer systems is the separation of address spac The completion of a descriptor may be signalled through a completion record and interrupt, if configured so. For this, the \gls{dsa} \enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table} \cite[Sec. 3.7]{intel:dsaspec}. \par \subsection{Software View} +\label{subsec:state:dsa-software-view} -\begin{figure}[H] +\begin{figure}[h] \centering \includegraphics[width=0.5\textwidth]{images/dsa-software-architecture.png} \caption{DSA Software View \cite[Fig. 1 (b)]{intel:analysis}} @@ -79,19 +80,30 @@ The completion of a descriptor may be signalled through a completion record and Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc}, meaning that accessing the \gls{dsa} is not possible to user space applications. To interface with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously, this can be seen in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, light green and labled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}, to achieve this. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par -Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why Intel's Data Mover Library exists. \par +Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why \gls{intel:dml} exists. \par With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par \subsection{Programming Interface} -\todo{write this section} +As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} provides a high level interface to interact with the hardware accelerator, namely Intel \gls{dsa}. We choose to use the C++ interface and will now demonstrate its usage by example of a simple memcopy-implementation for the \gls{dsa}. \par -\begin{itemize} - \item choice is intel data mover library - \item two concepts, state-based for c-api and operation-based c++ - \item just explain the basics (no code) and refer to dml documentation -\end{itemize} +\begin{figure}[h] + \centering + \includegraphics[width=0.9\textwidth]{images/structo-dmlmemcpy.png} + \caption{DML Memcpy Implementation} + \label{fig:dml-memcpy} +\end{figure} + +In the function header of Figure \ref{fig:dml-memcpy} we notice two differences, when comparing with standard memcpy. The first is the template parameter named \texttt{path} and the second being the additional parameter \texttt{int node} which we will discuss later. With \texttt{path} the executing device, which can be the CPU or \gls{dsa}, is selected, giving the option between \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}) and \texttt{dml::automatic} where the latter dynamically selects the device at runtime, preferring \gls{dsa} over CPU execution. \cite[Sec. Quick Start]{intel:dmldoc} \par + +Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{sec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa_run_on_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par + +\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make_view(uint8_t* ptr, size_t size)} with which we create views for both source and destination, labled \texttt{src_view} and \texttt{dst_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par + +For submission, we chose to use the asynchronous style of a single descriptor in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit}, which takes an operation and operation specific parameters in and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem_copy}, we again notice one thing sticking out of the call. This is the addition to the operation specifier \texttt{.block\_on\_fault()} which submits the operation, so that it will handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc} \par + +After submission, we poll for the task completion with \texttt{handler.get()} in Figure \ref{fig:dml-memcpy} and check whether the operation completed successfully. \par \section{System Setup and Configuration} \label{sec:state:setup-and-config} diff --git a/thesis/images/structo-dmlmemcpy.nsd b/thesis/images/structo-dmlmemcpy.nsd new file mode 100644 index 0000000..86d61b0 --- /dev/null +++ b/thesis/images/structo-dmlmemcpy.nsd @@ -0,0 +1,10 @@ + + + + + + + + + + \ No newline at end of file diff --git a/thesis/images/structo-dmlmemcpy.png b/thesis/images/structo-dmlmemcpy.png new file mode 100644 index 0000000..dea8188 Binary files /dev/null and b/thesis/images/structo-dmlmemcpy.png differ diff --git a/thesis/own.bib b/thesis/own.bib index ff826c3..bc29b6b 100644 --- a/thesis/own.bib +++ b/thesis/own.bib @@ -169,4 +169,11 @@ number={}, pages={1-4}, doi={10.1109/IMW.2017.7939084} -} \ No newline at end of file +} + +@misc{man-libnuma, + author = {Debian}, + title = {{Debian manpage 3 for libnuma-dev}}, + urldate = {2024-01-21}, + url = {https://manpages.debian.org/bookworm/libnuma-dev/numa.3.en.html} +}