bachelor-thesis/thesis/content/30_performance.tex


								\chapter{Performance Microbenchmarks}

								\label{chap:perf}


								In this chapter, we measure the performance of the \gls{dsa}, with the goal to determine an effective utilization strategy to apply the \gls{dsa} to \gls{qdp}. In Section \ref{sec:perf:method} we lay out our benchmarking methodology, then perform benchmarks in \ref{sec:perf:bench} and finally summarize our findings in \ref{sec:perf:analysis}. As the performance of the \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}, we will perform only a limited amount of benchmarks with the purpose of determining behaviour in a multi-socket system, penalties from using \gls{intel:dml}, and throughput for transfers from \glsentryshort{dram} to \gls{hbm}. \par


								\section{Benchmarking Methodology}

								\label{sec:perf:method}


								\begin{figure}[!b]

								    \centering

								    \includegraphics[width=0.9\textwidth]{images/xeonmax-numa-layout.png}

								    \caption{Xeon Max \glsentrylong{node} Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate \glsentryshort{numa:node} IDs for manual \glsentryshort{hbm} access and for Cores and \glsentryshort{dram}.}

								    \label{fig:perf-xeonmaxnuma}

								\end{figure}


								The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 \glsentryshort{numa:node}s that have access to 16 GiB of \gls{hbm} and 12 physical cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \cite{intel:xeonmax-ark} \par


								As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s (see Section \ref{sec:state:dml}), we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be inspected in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par


								\pagebreak


								\begin{figure}[t]

								    \centering

								    \includegraphics[width=0.35\textwidth]{images/nsd-benchmark.pdf}

								    \caption{Outer Benchmark Procedure Pseudocode. Timing marked with yellow background. Showing preparation of memory locations, clearing of cache entries, timing points and synchronized benchmark launch.}

								    \label{fig:benchmark-function:outer}

								\end{figure}


								The benchmark performs \gls{numa:node}-setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts, as demonstrated by the calls to \texttt{memset} in Figure \ref{fig:benchmark-function:outer}. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par


								Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par


								\begin{figure}[t]

								    \centering

								    \includegraphics[width=1.0\textwidth]{images/nsd-benchmark-inner.pdf}

								    \caption{Inner Benchmark Procedure Pseudocode. Showing work submission for single and batch submission.}

								    \label{fig:benchmark-function:inner}

								\end{figure}


								To get accurate results, the benchmark is repeated \(10\) times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in under \(100\ ms\). This short execution window can have adverse effects on the timings. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. Figure \ref{fig:benchmark-function:inner} depicts this behaviour as the for-loop. The chosen internal repetition count is \(10000\) for transfers in the range of \(1-8\ KiB\), \(1000\) for \(1\ MiB\), and one for larger instances. \par


								For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. This is shown by the accesses to \enquote{LAUNCH\_BARRIER} in both Figure \ref{fig:benchmark-function:inner} and \ref{fig:benchmark-function:outer}. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This selection is displayed in Figure \ref{fig:benchmark-function:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. A prepared sequence is submitted to the engine similarly to a single descriptor. \par


								\section{Benchmarks}

								\label{sec:perf:bench}


								In this section, we will introduce three benchmarks, providing setup information and a brief overview of each. Subsequently, we will present plots illustrating the results, followed by a comprehensive analysis. Our approach involves formulating expectations and juxtaposing them with the observations derived from our measurements. \par


								\subsection{Submission Method}

								\label{subsec:perf:submitmethod}


								With each submission, descriptors must be prepared and sent to the underlying hardware. This process is anticipated to incur a cost, impacting throughput sizes and submission methods differently. We submit different sizes and compare batching with single submissions, determining which combination of submission method and size is most effective. \par


								We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences it only once for multiple copies. \par


								\begin{figure}[t]

								    \centering

								    \includegraphics[width=0.5\textwidth]{images/plot-submitmethod.pdf}

								    \caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the \glsentryshort{dsa} on \glsentryshort{numa:node} 0. Observable is the submission cost which affects small transfer sizes differently.}

								    \label{fig:perf-submitmethod}

								\end{figure}


								In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of \(1\ MiB\) and upwards, the cost of single submission drops. Even for transferring the largest datum in this benchmark, we still observe a speed delta between batching and single submission, pointing to substantial submission costs encountered for submissions to the \gls{dsa:swq} and task preparation. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. noted that \enquote{SWQ observes lower throughput between \(1-8\ KB\) [transfer size]} \cite[pp. 6]{intel:analysis}. We however measured a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. Another limitation is visible in our result, namely the inherent throughput limit per \gls{dsa} chip of close to \(30\ GiB/s\). This is caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par


								\subsection{Multithreaded Submission}

								\label{subsec:perf:mtsubmit}


								As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads, with the latter representing the core count of one processing sub-node on the test system. We spawn multiple threads, all submitting to one \gls{dsa}. Furthermore, we perform this benchmark with sizes of \(1\ MiB\) and \(1\ GiB\) to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par


								\begin{figure}[t]

								    \centering

								    \includegraphics[width=0.5\textwidth]{images/plot-mtsubmit.pdf}

								    \caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the DSA on \glsentryshort{numa:node} 0.}

								    \label{fig:perf-mtsubmit}

								\end{figure}


								In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible negative impact. The synchronization appears to affect single-threaded access in the same manner as it does for multiple threads. Interestingly, for the smaller size of \(1\ MiB\), our assumption proved accurate, and performance increased with the addition of threads, which we attribute to enhanced queue usage. We ascribe the higher throughput observed with \(1\ GiB\) to the submission delay which is incurred more frequently with lower transfer sizes. \par


								\subsection{Data Movement from \glsentryshort{dram} to \glsentryshort{hbm}}

								\label{subsec:perf:datacopy}


								\begin{align}

								    2\ DIMM \times \frac{4400\ MT}{s\ \times\ DIMM} &= 8800\ MT/s \\

								    \frac{64b}{8b/B}\ /\ T &= 8\ B/T \\

								    8800\ MT/s \times 8B/T = 70400 \times 10^6 B/s &= 65.56\ GiB/s

								\end{align}


								Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. With \gls{hbm} offering higher bandwidth than the \glsentryshort{dram} of our system, we will be restricted by the available bandwidth of the source. To determine the upper limit achievable, we must calculate the available peak bandwidth. For each \gls{numa:node}, the test system is configured with two DIMMs of DDR5-4800. The naming scheme contains the data rate in Megatransfers (MT) per second, however the processor specification notes that for dual channel operation, the maximum supported speed drops to \(4400\ MT/s\) \cite{intel:xeonmax-ark}. We calculate the transfers performed per second for one \gls{numa:node} (3.1), followed by the bytes per transfer \cite{kingston:ddr5-spec-overview} in calculation (3.2), and at last combine these two for the theoretical peak bandwidth per \gls{numa:node} on the system (3.3). \par


								\begin{figure}[t]

								    \centering

								    \begin{subfigure}[t]{0.225\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-1dsa-throughput.pdf}

								        \caption{One \gls{dsa}: \gls{dsa} on source \glsentryshort{numa:node}.}

								        \label{fig:perf-dsa:1}

								    \end{subfigure}

								    \hspace{1mm}

								    \begin{subfigure}[t]{0.225\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-2dsa-throughput.pdf}

								        \caption{Two \gls{dsa}s, or \enquote{Push-Pull}: using the \glsentryshort{dsa} on source and destination \glsentryshort{numa:node} except intra-node using the on-node and one off-node.}

								        \label{fig:perf-dsa:2}

								    \end{subfigure}

								    \hspace{1mm}

								    \begin{subfigure}[t]{0.225\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-4dsa-throughput.pdf}

								        \caption{Four \gls{dsa}s: using four on-socket \glsentryshort{dsa} for intra-socket and two on each socket for inter-socket.}

								        \label{fig:perf-dsa:4}

								    \end{subfigure}

								    \hspace{1mm}

								    \begin{subfigure}[t]{0.225\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-8dsa-throughput.pdf}

								        \caption{Eight \gls{dsa}s or \enquote{Brute-Force}: using all available \glsentryshort{dsa}, irrespective of source and destination locations.}

								        \label{fig:perf-dsa:8}

								    \end{subfigure}

								    \caption{Copy from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. Shows peak throughput achievable with \glsentryshort{dsa} for different load balancing techniques.}

								    \label{fig:perf-dsa}

								\end{figure}


								From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56\ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple processors, data movement between sockets must also be evaluated, as additional bandwidth limitations may be encountered \cite{bench:heterogeneous-communication}. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available source memory. \par


								To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par


								For this benchmark, we transfer \(1\ GiB\) of data from \gls{numa:node} 0 to the destination \gls{numa:node}. We present data for \gls{numa:node}s 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the \gls{numa:node} IDs of the configured systems and the corresponding storage technology. \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, making it the physically closest possible destination. \gls{numa:node} 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. \gls{numa:node}s 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par


								We begin by examining the common behaviour of load balancing techniques depicted in Figure \ref{fig:perf-dsa}. The real-world peak throughput of \(64\ GiB/s\) approaches the calculated available bandwidth. In Figure \ref{fig:perf-dsa:1}, a notable hard bandwidth limit is observed, just below the \(30\ GiB/s\) mark, reinforcing what was encountered in Section \ref{subsec:perf:submitmethod}: a single \gls{dsa} is constrained by I/O-Fabric limitations. \par


								Unexpected throughput differences are evident for all configurations, except the bandwidth-bound single \gls{dsa}. Notably, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11. As \gls{numa:node} 8 serves as the \gls{hbm} accessor for the data source \gls{numa:node}, the data path is the shortest tested. The lower throughput suggests that the \gls{dsa} may suffer from sharing parts of the path for reading and writing in this situation. Another interesting observation is that, contrary to our assumption, the physically more distant \gls{numa:node} 15 achieves higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine this behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par


								\begin{figure}[t]

								    \centering

								    \begin{subfigure}[t]{0.35\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-average-throughput.pdf}

								        \caption{Average Throughput for different amounts of participating \gls{dsa}.}

								        \label{fig:perf-dsa-analysis:average}

								    \end{subfigure}

								    \hspace{5mm}

								    \begin{subfigure}[t]{0.35\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-dsa-throughput-scaling.pdf}

								        \caption{Scaling Behaviour for different amounts of participating \gls{dsa}. Displays factor of performance from utilizing one \gls{dsa}.}

								        \label{fig:perf-dsa-analysis:scaling}

								    \end{subfigure}

								    \caption{Scalability Analysis for different amounts of participating \gls{dsa}s. Displays the average throughput and the derived scaling factor. Shows that, although the throughput does increase with adding more accelerators, beyond two, the gained speed drops significantly. Calculated over the results from Figure \ref{fig:perf-dsa} and therefore applies to copies from \glsentryshort{dram} to \glsentryshort{hbm}.}

								    \label{fig:perf-dsa-analysis}

								\end{figure}


								For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds, close to the calculated theoretical limit, when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par


								When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par


								\pagebreak


								From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the processors' interconnect, however, as no hard limit is encountered, this is not a definitive answer. Utilizing off-socket \gls{dsa}s proves disadvantageous in the average case, reinforcing the claim of communication overhead for crossing the socket boundary. \par


								The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par


								\pagebreak


								\begin{figure}[t]

								    \centering

								    \begin{subfigure}[t]{0.35\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-8cpu-throughput.pdf}

								        \caption{DML code for allnodes running on software path.}

								        \label{fig:perf-cpu:swpath}

								    \end{subfigure}

								    \hspace{5mm}

								    \begin{subfigure}[t]{0.35\textwidth}

								        \centering

								        \includegraphics[width=\textwidth]{images/plot-andrepeak-throughput.pdf}

								        \caption{Colleague's CPU peak throughput benchmark \cite{xeonmax-peakthroughput} results.}

								        \label{fig:perf-cpu:andrepeak}

								    \end{subfigure}

								    \caption{Throughput from \glsentryshort{dram} to \glsentryshort{hbm} on CPU. Copying from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis.}

								    \label{fig:perf-cpu}

								\end{figure}


								\subsection{Data Movement using CPU}

								\label{subsec:perf:cpu-datacopy}


								For evaluating CPU copy performance we use the benchmark code from the previous Section (Section \ref{subsec:perf:datacopy}), selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par


								As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation, validating the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as observed in Section \ref{subsec:perf:datacopy}, can be seen in Figure \ref{fig:perf-cpu}. This points to an architectural anomaly which we could not explain with our knowledge or benchmarks. Further benchmarks were conducted for this, not yielding conclusive results, as the source and copy-thread location did not seem to affect the observed speed delta. \par


								\pagebreak


								\section{Analysis}

								\label{sec:perf:analysis}


								In this section we summarize the conclusions from the performed benchmarks, outlining a utilization guideline. \par


								\begin{itemize}

								    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\). Otherwise, offloading might prove more expensive than performing the copy on CPU.

								    \item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using a \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.

								    \item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.

								    \item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding led us to implement round-robin balancing in Section \ref{sec:impl:application}.

								\end{itemize}


								Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computational tasks. \par


								We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, pointing to a problem in the system architecture, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in the determination of an optimal utilization strategy for the \gls{dsa}. \par


								%%% Local Variables:

								%%% TeX-master: "diplom"

								%%% End: