From 711b29e74c6b9e89a9ea77a7539769a441598572 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Constantin=20F=C3=BCrst?= Date: Sun, 21 Jan 2024 16:36:37 +0100 Subject: [PATCH] finish sections on system setup and dml interface for chapter 2 --- thesis/content/20_state.tex | 40 ++++++++++++++++++------------------- thesis/own.bib | 21 +++++++++++++++++++ 2 files changed, 41 insertions(+), 20 deletions(-) diff --git a/thesis/content/20_state.tex b/thesis/content/20_state.tex index 48fb42a..cfc09d8 100644 --- a/thesis/content/20_state.tex +++ b/thesis/content/20_state.tex @@ -32,6 +32,7 @@ \todo{write introductory paragraph} \section{High Bandwidth Memory} +\label{sec:state:hbm} \glsentrylong{hbm} is a novel memory technology promising an increase in peak bandwidth. It is composed of stacked DRAM dies \cite[p. 1]{hbm-arch-paper} and is slowly being integrated into server processors, notably the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief}. \gls{hbm} on these systems may be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes while the latter utilizes the \gls{hbm} as cache for the systems DDR based main memory \cite{intel:xeonmaxbrief}. \par @@ -84,7 +85,7 @@ Given the file permissions, it would now be possible for a process to submit wor With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par -\subsection{Programming Interface} +\section{Programming Interface} As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} provides a high level interface to interact with the hardware accelerator, namely Intel \gls{dsa}. We choose to use the C++ interface and will now demonstrate its usage by example of a simple memcopy-implementation for the \gls{dsa}. \par @@ -97,32 +98,31 @@ As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} In the function header of Figure \ref{fig:dml-memcpy} we notice two differences, when comparing with standard memcpy. The first is the template parameter named \texttt{path} and the second being the additional parameter \texttt{int node} which we will discuss later. With \texttt{path} the executing device, which can be the CPU or \gls{dsa}, is selected, giving the option between \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}) and \texttt{dml::automatic} where the latter dynamically selects the device at runtime, preferring \gls{dsa} over CPU execution. \cite[Sec. Quick Start]{intel:dmldoc} \par -Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{sec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa_run_on_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par +Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{sec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa\_run\_on\_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par -\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make_view(uint8_t* ptr, size_t size)} with which we create views for both source and destination, labled \texttt{src_view} and \texttt{dst_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par +\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)} with which we create views for both source and destination, labled \texttt{src\_view} and \texttt{dst\_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par -For submission, we chose to use the asynchronous style of a single descriptor in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit}, which takes an operation and operation specific parameters in and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem_copy}, we again notice one thing sticking out of the call. This is the addition to the operation specifier \texttt{.block\_on\_fault()} which submits the operation, so that it will handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc} \par +For submission, we chose to use the asynchronous style of a single descriptor in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit}, which takes an operation and operation specific parameters in and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem\_copy}, we again notice one thing sticking out of the call. This is the addition to the operation specifier \texttt{.block\_on\_fault()} which submits the operation, so that it will handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc} \par After submission, we poll for the task completion with \texttt{handler.get()} in Figure \ref{fig:dml-memcpy} and check whether the operation completed successfully. \par \section{System Setup and Configuration} \label{sec:state:setup-and-config} -\todo{write this section} - -Give the reader the tools to replicate the setup. -Also explain why the BIOS-configs are required. - -Setup Requirements: -\begin{itemize} - \item VT-d enabled - \item limit CPUPA to 46 Bits disabled - \item IOMMU enabled - \item kernel with iommu and idxd driver support - \item kernel option "intel\_iommu=on,sm\_on" - \item numa nodes for hbm access in bios -\end{itemize} - -\cleardoublepage +In this section we will give a brief step-by-step list of setup instructions to replicate the configuration being used for benchmarks and testing purposes in the following chapters. We found Intel's guide on \gls{dsa} usage useful but consulted articles for setup on Lenovo ThinkSystem Servers for crucial information not present in the former. Instructions for configuring the HBM access mode, as mentioned in Section \ref{sec:state:hbm}, may vary from system to system and can require extra steps not found in the list below. \par + +\begin{enumerate} + \item Set \enquote{Memory Hierarchy} to Flat in BIOS \\ \cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm} + \item Set \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide} + \item If available, set the option \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \\ \cite[p. 5]{lenovo:dsa} + \item Use a kernel with IDXD driver support, available from Linux 5.10 or later \\ \cite[Sec. Installation]{intel:dmldoc} + \item Append the following to the kernel boot parameters in grub config: \\ \texttt{intel\_iommu=on,sm\_on} \cite[p. 5]{lenovo:dsa} + \item Evaluate correct detection of \gls{dsa} devices using \texttt{dmesg | grep idxd} which should list as many devices as NUMA nodes on the system \cite[p. 5]{lenovo:dsa} + \item Install \texttt{accel-config} from repository \cite{intel:libaccel-config-repo} or other package manager + \item Inspect the not yet configured \gls{dsa} devices using \texttt{accel-config list -i} \\ \cite[p. 6]{lenovo:dsa} + \item Create \gls{dsa} configuration file which may be based upon the one used for most benchmarks available under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo} + \item Apply the configuration file from the previous step using \texttt{accel-config load-config -c [filename] -e} \cite[Fig. 3-9]{intel:dsaguide} + \item Inspect the now configured \gls{dsa} devices using \texttt{accel-config list} \cite[p. 7]{lenovo:dsa}, output should match the desired configuration set in the file provided to \texttt{accel-config load-config} +\end{enumerate} %%% Local Variables: %%% TeX-master: "diplom" diff --git a/thesis/own.bib b/thesis/own.bib index bc29b6b..34eecbf 100644 --- a/thesis/own.bib +++ b/thesis/own.bib @@ -177,3 +177,24 @@ urldate = {2024-01-21}, url = {https://manpages.debian.org/bookworm/libnuma-dev/numa.3.en.html} } + +@ONLINE{lenovo:dsa, + author = {Adrian Huang}, + title = {{Enabling Intel Data Streaming Accelerator on Lenovo ThinkSystem Servers}}, + urldate = {2022-04-18}, + url = {https://lenovopress.lenovo.com/lp1582.pdf} +} + +@misc{thesis-repo, + author = {Anatol Constantin Fürst}, + title = {{Accompanying Thesis Repository}}, + url = {https://git.constantin-fuerst.com/constantin/bachelor-thesis} +} + +@ONLINE{lenovo:hbm, + author = {Sam Kuo, Jimmy Cheng}, + title = {{Implementing High Bandwidth Memory and Intel Xeon Processors Max Series on Lenovo ThinkSystem Servers}}, + date = {2023-06-26}, + url = {https://lenovopress.lenovo.com/lp1738.pdf}, + urldate = {2024-01-21} +} \ No newline at end of file