In this chapter, we measure the performance of the \gls{dsa}, with the goal to determine an effective utilization strategy to apply the \gls{dsa} to \gls{qdp}. In Section \ref{sec:perf:method} we lay out our benchmarking methodology, then perform benchmarks in \ref{sec:perf:bench} and finally summarize our findings in \ref{sec:perf:analysis}. \par
The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}. Therefore, we will perform only a limited amount of benchmarks with the purpose of verifying the figures from \cite{intel:analysis} and analysing best practices and restrictions for applying \gls{dsa} to \gls{qdp}. \par
\caption{Xeon Max NUMA-Node Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate Node IDs for manual HBM access and for Cores and DRAM.}
\label{fig:perf-xeonmaxnuma}
\end{figure}
The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 nodes that have access to 16 GiB of \gls{hbm} and 12 cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \par
As \gls{intel:dml} does not have support for \acrshort{dsa:dwq}s, we run benchmarks exclusively with access through \acrshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
\begin{figure}[h]
\centering
@ -12,18 +26,19 @@ The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper e
\label{fig:benchmark-function}
\end{figure}
Benchmarks were conducted on an Intel Xeon Max CPU, system configuration following Section \ref{sec:state:setup-and-config} with exclusive access to the system. As \gls{intel:dml} does not have support for \gls{dsa:dwq}, we ran benchmarks exclusively with access through \gls{dsa:swq}. The application written for the benchmarks can be obtained in source form under \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. With the full source code available we only briefly describe a section of pseudocode, as seen in Figure \ref{fig:benchmark-function}, in the following paragraph. \par
The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. These timings are marked with yellow background in Figure \ref{fig:benchmark-function}. To get accurate results, the benchmark is repeated multiple times. At the beginning of each repetition, all threads running will synchronize by use of a barrier. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is then submitted to the engine similar to the submission of a single descriptor. Further steps then follow the example from Section \ref{sec:state:dml} again. \par
The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. To get accurate results, the benchmark is repeated at least 100 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function}. The launch is synchronized by use of a barrier for each iteration. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par
\section{Benchmarks}
\section{sec:perf:bench}
In this section we will describe three benchmarks. Each complete with setup information and preview, followed by plots showing the results and detailed analysis. We formulate expectations and contrast them with the observations from our measurements. Where displayed, the slim grey bars represent the standard deviation across iterations. \par
In this section we will In this section, we will present three benchmarks, each accompanied by setup information and a preview. We will then provide plots displaying the results, followed by a detailed analysis. We will formulate expectations and compare them with the observations from our measurements. \par
\subsection{Submission Method}
\label{subsec:perf:submitmethod}
With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching and single submission, we will evaluate at which submission method performs best for each tested data size. We expect single submission to perform worse consistently, with a pronounced effect on smaller transfer sizes. This is assumed, as the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, while the batch only sees this overhead once for multiple copies. \par
With each submission, descriptors must be prepared and sent to the underlying hardware. This process is anticipated to incur a cost, impacting throughput sizes and submission methods differently. We submit different sizes and compare batching with single submissions, determining which combination of submission method and size is most effective. \par
We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences this overhead only once for multiple copies. \par
\begin{figure}[h]
\centering
@ -32,14 +47,14 @@ With each submission, descriptors must be prepared and sent off to the underlyin
\label{fig:perf-submitmethod}
\end{figure}
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]}\cite[p. 6 and 7]{intel:analysis} for normal submission method. \par
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. This finding is aligned with the observation that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]}\cite[p. 6 and 7]{intel:analysis} for normal submission method. \par
Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. We therefore conclude, that the use of multiple \gls{dsa} is required to fully utilize the available bandwidth of \gls{hbm} which theoretically lies at 256 GB/s \cite[Table I]{hbm-arch-paper}. \par
\subsection{Multithreaded Submission}
\label{subsec:perf:mtsubmit}
As we might encounter access to one \gls{dsa} from multiple threads through the \glsentrylong{dsa:swq} associated with it, determining the effect this type of access has is important. We benchmark multithreaded submission for one, two and twelve threads. The number twelve representing the core count of one node on the test system. Each configuration gets the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this with sizes of 1 MiB and 1 GiB to see, if the behaviour changes with submission size. As for smaller sizes, the completion time may be faster than submission time. Therefore, smaller task sizes may see different effects of threading due to the fact that multiple threads can work to fill the queue and ensure there is never a stall due it being empty. On the other hand, we might experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads - the latter representing the core count of one processing sub-node on the test system. Each configuration is assigned the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this benchmark with sizes of 1 MiB and 1 GiB to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
\begin{figure}[h]
\centering
@ -48,75 +63,73 @@ As we might encounter access to one \gls{dsa} from multiple threads through the
\label{fig:perf-mtsubmit}
\end{figure}
In Figure \ref{fig:perf-mtsubmit} we see that threading has no negative impact. The synchronization therefore affects single threaded access in the same way it does for multiple. For the smaller size of 1 MiB our assumption came true and performance actually increased with threads which we attribute to queue usage. We did not find an explanation for the speed difference between sizes, which goes against the rationale that higher transfer size would lead to less impact of submission time and therefore higher throughput. \par
In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible negative impact. The synchronization appears to affect single-threaded access in the same manner as it does for multiple threads. Interestingly, for the smaller size of 1 MiB, our assumption proved accurate, and performance actually increased with the addition of threads, which we attribute to enhanced queue usage. However, we were unable to identify an explanation for the speed difference between sizes. This finding contradicts the rationale that higher transfer sizes would result in less impact on submission time and, consequently, higher throughput. \par
\subsection{Data Movement from DDR to HBM}
\subsection{Data Movement from \gls{DRAM} to \gls{HBM}}
\label{subsec:perf:datacopy}
Moving data from DDR to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. We write to \gls{hbm} with its theoretical peak of 256 GB/s \cite[Table I]{hbm-arch-paper}. Our top speed is therefore limited by the slower main memory. For each node, the test system is configured with two DIMMs of DDR5-4800. We calculate the theoretical throughput as follows: \(2\ DIMMs *\frac{4800\ Megatransfers}{Second\ and\ DIMM}*\frac{64-bit width}{8 bits/byte}=76800\ MT/s =75\ GiB/s\)\todo{how to verify this calculation with a citation? is it even correct because we see close to 100 GiB/s throughput?}. We conclude that to achieve closer-to-peak speeds, a copy task has to be split across multiple \gls{dsa}. \par
Moving data from \gls{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. We write to \gls{hbm} with its theoretical peak of 256 GB/s \cite[Table I]{hbm-arch-paper}. Our top speed is therefore limited by the slower main memory. For each node, the test system is configured with two DIMMs of DDR5-4800. We calculate the theoretical throughput as follows: \(2\ DIMMs *\frac{4800\ Megatransfers}{Second\ and\ DIMM}*\frac{64-bit width}{8 bits/byte}=76800\ MT/s =75\ GiB/s\)\todo{how to verify this calculation with a citation? is it even correct because we see close to 100 GiB/s throughput?}. We conclude that to achieve closer-to-peak speeds, a copy task has to be split across multiple \gls{dsa}s. \par
Two methods of splitting will be evaluated. The first being a brute force approach, utilizing all available for any transfer direction. The seconds' behaviour depends on the data source and destination locations. As our system has multiple sockets, communication crossing the socket could experience a latency and bandwidth disadvantage \todo{cite this}. We argue that for intra-socket transfers, use of the \gls{dsa} from the second socket will have only marginal effect. For transfers crossing the socket, every \gls{dsa}chip will perform badly, and we choose to only use the ones with the fastest possible access, namely the ones on the destination and source node. This might result in lower performance but also uses one fourth of the engines of the brute force approach for inter-socket and half for intra-socket. This gives other threads the chance to utilize these now-free chips. \par
Two methods of splitting will be evaluated. The first employs a brute force approach, utilizing all available resources for any transfer direction. The second method's behaviour depends on the data source and destination locations. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages\todo{cite this}. We posit that for intra-socket transfers, utilizing the \gls{dsa} from the second socket will have only a marginal effect. For transfers crossing sockets, we assume every \gls{dsa}performs equally worse, prompting us to use only the ones on the destination and source nodes for them being the physically closest to both memory regions. While this choice may result in lower performance, it uses only one-fourth of the engines in the brute force approach for inter-socket transfers and half for intra-socket transfers. This approach also frees up additional chips for other threads to utilize.\par
\caption{Xeon Max NUMA-Node Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate Node IDs for manual HBM access and for Cores and DDR memory.}
\label{fig:perf-xeonmaxnuma}
\end{figure}
For this benchmark, we copy 1 Gibibyte of data from node 0 to the destination node, using the previously described submission method. For each node utilized, we spawn one thread pinned to it, in charge of submission. We display data for nodes 8, 11, 12 and 15. To understand the selection, we refer to Figure \ref{fig:perf-xeonmaxnuma} which shows the configured systems' node ids and the storage technology they represent. Node 8 accesses the \gls{hbm} on node 0 and is therefore the physically closest destination possible. Node 11 is located diagonally on the chip and therefore the furthest intra-socket operation benchmarked. Node 12 and 15 lie diagonally on the second sockets CPU and should therefore be representative of inter-socket transfer operations. \par
For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node, employing the submission method previously described. For each utilized node, we spawn one pinned thread responsible for submission. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the furthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
\caption{Throughput for brute force copy from DDR to HBM. Using all available DSA. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows peak achievable with DSA.}
\caption{Throughput for brute force copy from DRAM to HBM. Using all available DSA. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows peak achievable with DSA.}
\caption{Throughput for smart copy from DDR to HBM. Using four on-socket DSA for intra-socket operation and the DSA on source and destination node for inter-socket. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows conservative performance.}
\caption{Throughput for smart copy from DRAM to HBM. Using four on-socket DSA for intra-socket operation and the DSA on source and destination node for inter-socket. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows conservative performance.}
\label{fig:perf-peak-smart}
\end{figure}
From the brute force approach in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This invalidates our assumption, that peak bandwidth would be achieved in the intra-socket scenario\todo{find out why maybe?} and goes against the calculated peak throughput of the memory on our system\todo{calculation wrong? other factors?}. The results however, align with findings in \cite[Fig. 10]{intel:analysis}. \par
From the results of the brute force approach illustrated in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This contradicts our initial assumption that peak bandwidth would be achieved in the intra-socket scenario\todo{find out why, perhaps due to certain factors?} and goes against the calculated peak throughput of the memory on our system\todo{potential calculation error or other influencing factors?}. Nevertheless, these results align with findings presented in \cite[Fig. 10]{intel:analysis}. \par
While using significantly more resources, the brute force copy shown in Figure \ref{fig:perf-peak-brute}outperforms the smart approach from Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa} of 2 GiB/s for copy to Node 8, 18 GiB/s for Node 11 and 12 and 30 GiB/s for Node 15. The smart approach could accommodate another intra-socket copy on the second socket, we assume without seeing negative impact. From this we conclude that the smart copy assignment is worth to use, as it provides better scalability. \par
While consuming significantly more resources, the brute force copy depicted in Figure \ref{fig:perf-peak-brute}surpasses the performance of the smart approach shown in Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa}, achieving 2 GiB/s for copying to Node 8, 18 GiB/s for Nodes 11 and 12, and 30 GiB/s for Node 15. The smart approach could accommodate another intra-socket copy on the second socket, we assume, without observing negative impacts. From this, we conclude that the smart copy assignment is worth using, as it provides better scalability. \par
\subsection{Data Movement using CPU}
For evaluating CPU copy performance two approaches were selected. For Figure \ref{fig:perf-cpu-peak-smart} we used the smart copy procedure as in \ref{subsec:perf:datacopy}, just selecting the software path, with no other changes to the test. In Figure \ref{fig:perf-cpu-peak-brute} the test was adapted to spawn 12 threads on Node 0. \par
For evaluating CPU copy performance two approaches were selected. One requires no code changes, showing the performance of running code developed for \gls{dsa} on systems where this hardware is not present. For the other approach we modified the code to result in peak performance. \par
\caption{Throughput from DDR to HBM on CPU. Using 12 Threads spawned on Node 0 for the task. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis.}
\caption{Throughput from DRAM to HBM on CPU. Using the exact same code as smart copy, therefore spawning one thread on Node 0, 1, 2 and 3. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. This shows the low performance of software path, when not adapting the code.}
\label{fig:perf-cpu-peak-smart}
\end{figure}
The benchmark resulting in Figure \ref{fig:perf-cpu-peak-brute} fully utilizes node 0 of the test system through spawning 12 threads. We attribute the low performance of the inter-socket copy operation to overhead of the interconnect. The slight difference between node 12 and 15 may be explained using Figure \ref{fig:perf-xeonmaxnuma}, where we observe that physically node 12 is closer to node 0 than node 15. \par
For Figure \ref{fig:perf-cpu-peak-smart} we used the smart copy procedure as in \ref{subsec:perf:datacopy}, selecting the software path with no other changes to the benchmarking code. \par
\caption{Throughput from DDR to HBM on CPU. Using the exact same code as smart copy, therefore spawning one thread on Node 0, 1, 2 and 3. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. This shows the low performance of software path, when not adapting the code.}
\caption{Throughput from DRAM to HBM on CPU. Using 12 Threads spawned on Node 0 for the task. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis.}
\label{fig:perf-cpu-peak-brute}
\end{figure}
Comparing the results in Figure \ref{fig:perf-cpu-peak-smart} with the data from the adapted test displayed in Figure \ref{fig:perf-cpu-peak-brute} we conclude that the software path just serves compatibility. The throughput is comparatively low, requiring code changes to increase performance. This leads us not to recommend using it. \par
The benchmark resulting in Figure \ref{fig:perf-cpu-peak-brute} fully utilizes node 0 of the test system by spawning 12 threads on it. This level of utilization required code modifications beyond selecting a different execution path and would therefore not be achieved without additional programming efforts. \par
We attribute the low performance of the inter-socket copy operation to overhead of the interconnect between the two sockets. We attribute the slight difference between node 12 and 15 to the on-chip location of these in reference to node 0 which. As Figure \ref{fig:perf-xeonmaxnuma} shows, node 12 is slightly closer to node 0. \par
Comparing the results in Figure \ref{fig:perf-cpu-peak-smart} with the data from the adapted test displayed in Figure \ref{fig:perf-cpu-peak-brute} we conclude that the software path primarily serves compatibility. The throughput is comparatively low, requiring code changes to increase performance. This leads us to advice against its use. \par
\section{Analysis}
\label{sec:perf:analysis}
In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline. We also compare CPU and \gls{dsa} for the task of copying data from DDR to \gls{hbm}. \par
In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm}. \par
\begin{enumerate}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Datum size should therefore be at or above 1 MiB if possible.
\begin{itemize}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB if possible.
\item Subsection \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission.
\item In \ref{subsec:perf:datacopy}, we chose to use the presented smart copy methodology to split copy tasks across multiple \gls{dsa} chips to achieve low utilization with acceptable performance.
\end{enumerate}
\end{itemize}
We again refer to Figures \ref{fig:perf-peak-brute} and \ref{fig:perf-cpu-peak-brute} which both represent the maximum throughput achieved with utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from the inter-socket overhead like the CPU. To the contrary, we do see the highest throughput when copying across sockets\todo{explanation?}. In any case, \gls{dsa} outperforms the CPU for transfer sizes over 1 MiB, demonstrating its potential to increase throughput while at the same time freeing CPU cycles. \par
Once again, we refer to Figures \ref{fig:perf-peak-brute} and \ref{fig:perf-cpu-peak-brute}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. On the contrary, we observe the highest throughput when copying across sockets\todo{explanation needed?}. In any case, \gls{dsa} outperforms the CPU for transfer sizes over 1 MiB, demonstrating its potential to increase throughput while simultaneously freeing up CPU cycles. \par