fix some compilation-blocking syntax mistakes, improve usage of glossary with short descriptions, use glossary for titles and in figure captions with explicit long/short entries
\glsentrylong{hbm} is a novel memory technology that promises an increase in peak bandwidth. It consists of stacked \acrshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \gls{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\glsentrylong{hbm} is a novel memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\section{\glsentrylong{qdp}}
\todo{write this section}
\section{Intel Data Streaming Accelerator}
\section{\glsentrylong{dsa}}
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}
@ -48,10 +48,10 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
\caption{DSA Internal Architecture \cite[Fig. 1 (a)]{intel:analysis}. Shows the components that the chip is made up of, how they are connected and which outside components the DSA communicates with.}
\caption{\glsentrylong{dsa} Internal Architecture \cite[Fig. 1 (a)]{intel:analysis}. Shows the components that the chip is made up of, how they are connected and which outside components the \glsentryshort{dsa} communicates with.}
\label{fig:dsa-internal-block}
\end{figure}
@ -59,11 +59,11 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
\subsubsection{Architectural Components}
\textrm{Component \rom{1}\glsentrylong{dsa:wq}:}\gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{1}\glsentrylong{dsa:wq}:}\gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textrm{Component \rom{2} Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{2} Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textrm{Component \rom{3} Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\textsc{Component \rom{3} Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\subsubsection{Virtual Address Resolution}
@ -81,10 +81,10 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
\caption{DSA Software View \cite[Fig. 1 (b)]{intel:analysis}. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission.}
\caption{\glsentrylong{dsa} Software View \cite[Fig. 1 (b)]{intel:analysis}. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission.}
\label{fig:dsa-software-arch}
\end{figure}
@ -94,15 +94,15 @@ With the appropriate file permissions, a process could submit work to the \gls{d
With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc}\par
\section{Programming Interface}
\section{Programming Interface for \glsentrylong{dsa}}
\label{sec:state:dml}
As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
\caption{DML Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The DSA executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (DSA) or software (CPU).}
\caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (Intel \glsentryshort{dsa}) or software (CPU).}
\caption{Xeon Max NUMA-Node Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate Node IDs for manual HBM access and for Cores and DRAM.}
\caption{Xeon Max \glsentrylong{node} Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate \glsentryshort{numa:node} IDs for manual \glsentryshort{hbm} access and for Cores and \glsentryshort{dram}.}
\label{fig:perf-xeonmaxnuma}
\end{figure}
The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 nodes that have access to 16 GiB of \gls{hbm} and 12 cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \par
As \gls{intel:dml} does not have support for \acrshort{dsa:dwq}s, we run benchmarks exclusively with access through \acrshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s, we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
\caption{Benchmark Procedure Pseudocode. Timing marked with yellow background. Showing data allocation and the benchmarking loop for a single thread.}
@ -29,7 +29,7 @@ As \gls{intel:dml} does not have support for \acrshort{dsa:dwq}s, we run benchma
The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. To get accurate results, the benchmark is repeated at least 100 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function}. The launch is synchronized by use of a barrier for each iteration. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par
\section{Benchmarks}
\section{sec:perf:bench}
\label{sec:perf:bench}
In this section we will In this section, we will present three benchmarks, each accompanied by setup information and a preview. We will then provide plots displaying the results, followed by a detailed analysis. We will formulate expectations and compare them with the observations from our measurements. \par
@ -40,10 +40,10 @@ With each submission, descriptors must be prepared and sent to the underlying ha
We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences this overhead only once for multiple copies. \par
\caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being node 0, executed by the DSA on node 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
\caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being node 0, executed by the \glsentryshort{dsa} on node 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
\label{fig:perf-submitmethod}
\end{figure}
@ -56,7 +56,7 @@ Another limitation may be observed in this result, namely the inherent throughpu
As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads - the latter representing the core count of one processing sub-node on the test system. Each configuration is assigned the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this benchmark with sizes of 1 MiB and 1 GiB to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
\caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being node 0, executed by the DSA on node 0.}
@ -65,7 +65,7 @@ As we might encounter access to one \gls{dsa} from multiple threads through the
In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible negative impact. The synchronization appears to affect single-threaded access in the same manner as it does for multiple threads. Interestingly, for the smaller size of 1 MiB, our assumption proved accurate, and performance actually increased with the addition of threads, which we attribute to enhanced queue usage. However, we were unable to identify an explanation for the speed difference between sizes. This finding contradicts the rationale that higher transfer sizes would result in less impact on submission time and, consequently, higher throughput. \par
\subsection{Data Movement from \gls{DRAM} to \gls{HBM}}
\subsection{Data Movement from \glsentryshort{dram} to \glsentryshort{hbm}}
\label{subsec:perf:datacopy}
Moving data from \gls{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. We write to \gls{hbm} with its theoretical peak of 256 GB/s \cite[Table I]{hbm-arch-paper}. Our top speed is therefore limited by the slower main memory. For each node, the test system is configured with two DIMMs of DDR5-4800. We calculate the theoretical throughput as follows: \(2\ DIMMs *\frac{4800\ Megatransfers}{Second\ and\ DIMM}*\frac{64-bit width}{8 bits/byte}=76800\ MT/s =75\ GiB/s\)\todo{how to verify this calculation with a citation? is it even correct because we see close to 100 GiB/s throughput?}. We conclude that to achieve closer-to-peak speeds, a copy task has to be split across multiple \gls{dsa}s. \par
@ -74,17 +74,17 @@ Two methods of splitting will be evaluated. The first employs a brute force appr
For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node, employing the submission method previously described. For each utilized node, we spawn one pinned thread responsible for submission. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the furthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
\caption{Throughput for brute force copy from DRAM to HBM. Using all available DSA. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows peak achievable with DSA.}
\caption{Throughput for brute force copy from \glsentryshort{dram} to \glsentryshort{hbm}. Using all available DSA. Copying 1 GiB from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. Shows peak throughput achievable with \glsentryshort{dsa}.}
\caption{Throughput for smart copy from DRAM to HBM. Using four on-socket DSA for intra-socket operation and the DSA on source and destination node for inter-socket. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows conservative performance.}
\caption{Throughput for smart copy from \glsentryshort{dram} to \glsentryshort{hbm}. Using four on-socket \glsentryshort{dsa} for intra-socket operation and the \glsentryshort{dsa} on source and destination node for inter-socket. Copying 1 GiB from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. Shows conservative performance.}
\label{fig:perf-peak-smart}
\end{figure}
@ -96,19 +96,19 @@ While consuming significantly more resources, the brute force copy depicted in F
For evaluating CPU copy performance two approaches were selected. One requires no code changes, showing the performance of running code developed for \gls{dsa} on systems where this hardware is not present. For the other approach we modified the code to result in peak performance. \par
\caption{Throughput from DRAM to HBM on CPU. Using the exact same code as smart copy, therefore spawning one thread on Node 0, 1, 2 and 3. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. This shows the low performance of software path, when not adapting the code.}
\caption{Throughput from \glsentryshort{dram} to \glsentryshort{hbm} on CPU. Using the exact same code as smart copy, therefore spawning one thread on Node 0, 1, 2 and 3. Copying 1 GiB from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. This shows the low performance of software path, when not adapting the code.}
\label{fig:perf-cpu-peak-smart}
\end{figure}
For Figure \ref{fig:perf-cpu-peak-smart} we used the smart copy procedure as in \ref{subsec:perf:datacopy}, selecting the software path with no other changes to the benchmarking code. \par
\caption{Throughput from DRAM to HBM on CPU. Using 12 Threads spawned on Node 0 for the task. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis.}
\caption{Throughput from \gls{dram} to \gls{hbm} on CPU. Using 12 Threads spawned on \gls{numa:node} 0 for the task. Copying 1 GiB from \gls{numa:node} 0 to the destination \gls{numa:node} specified on the x-axis.}
@ -36,10 +36,10 @@ The task of prefetching is somewhat aligned with that of a cache. As a cache is
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating an update either from the source or vice versa. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
\caption{Public Interface of CacheData and Cache Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
\caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
@ -46,7 +46,7 @@ Therefore, the decision was made to implement atomic reference counting for \tex
Additionally, the invalid state of \texttt{CacheData} is avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer, and the pointer to the cache memory must be atomic too. This enables the use of the atomic wait operation, which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\) and \(T_3\). Then \(T_1\) holds the handlers exclusively, leading to the other threads having to wait for \(T_1\) to perform the work submission and waiting before they can access the datum through the cache.}
@ -55,7 +55,7 @@ Additionally, the invalid state of \texttt{CacheData} is avoided. To achieve thi
Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Now \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Now only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par