@ -16,7 +16,12 @@ Benchmarks were conducted on an Intel Xeon Max CPU, system configuration followi
The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. These timings are marked with yellow background in Figure \ref{fig:benchmark-function}, which shows that we measure three regions during our benchmark with one being the total time and the other two being time for submission and time for completion. To get accurate results, the benchmark is repeated multiple times and for small durations we only evaluate the total time, as the single measurements seemed to contain to many irregularities. At the beginning of each repetition, all threads running will synchronize by use of a barrier. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is then submitted to the engine similar to the submission of a single descriptor. Further steps then follow the example from Section \ref{sec:state:dml} again. \par
\section{Submission Method}
\section{Benchmarks}
In this section we will describe three benchmarks. Each complete with setup information and preview, followed by plots showing the results and detailed analysis. We formulate expectations and contrast them with the observations from our measurements. \par
\subsection{Submission Method}
\label{subsec:perf:submitmethod}
With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching and single submission, we will evaluate at which data size which submission method makes sense. We expect single submission to perform worse consistently, with a pronounced effect on smaller transfer sizes. This is assumed, as the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, while the batch only sees this overhead once for multiple copies. \par
@ -31,44 +36,69 @@ In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB a
Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. We therefore conclude, that the use of multiple \gls{dsa} is required to fully utilize the available bandwidth of \gls{hbm} which theoretically lies at 256 GB/s \cite[Table I]{hbm-arch-paper}. \par
\section{Multithreaded Submission}
\subsection{Multithreaded Submission}
\label{subsec:perf:mtsubmit}
As we might encounter access to one \gls{dsa} from multiple threads through the \glsentrylong{dsa:swq} associated with it, determining the effect this type of access has is important. We benchmark multithreaded submission for one, two and twelve threads. Each configuration gets the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this with sizes of 1 MiB and 1 GiB to see, if the behaviour changes with submission size. As for smaller sizes, the completion time may be faster than submission time. Therefore, smaller task sizes may see different effects of threading due to the fact that multiple threads can work to fill the queue and ensure there is never a stall due it being empty. On the other hand, we might experience lower-than-peak throughput, caused by the synchronization inherent with \gls{dsa:swq}. \par
\caption{Throughput for different Thread Counts and Sizes}
\label{fig:perf-mtsubmit}
\end{figure}
\todo{write this section}
In Figure \ref{fig:perf-mtsubmit} we see that threading has no negative impact. The synchronization therefore affects single threaded access in the same way it does for multiple. For the smaller size of 1 MiB our assumption came true and performance actually increased with threads which we attribute to queue usage. We did not find an explanation for the speed difference between sizes, which goes against the rationale that higher transfer size would lead to less impact of submission time and therefore higher throughput. \par
\begin{itemize}
\item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
\item show results for all available core counts
\item only display the 1engine tests
\item show combined total throughput
\item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
\end{itemize}
\subsection{Data Movement from DDR to HBM}
\label{subsec:perf:datacopy}
\todo{write this section}
Moving data from DDR to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. We write to \gls{hbm} with its theoretical peak of 256 GB/s \cite[Table I]{hbm-arch-paper}. Our top speed is therefore limited by the slower main memory. For each node, the test system is configured with two DIMMs of DDR5-4800. We calculate the theoretical throughput as follows: \(2\ DIMMs *\frac{4800\ Megatransfers}{Second\ and\ DIMM}*\frac{64-bit width}{8 bits/byte}=76800\ MT/s =75\ GiB/s\)\todo{how to verify this calculation with a citation? is it even correct because we see close to 100 GiB/s throughput?}. We conclude that to achieve closer-to-peak speeds, a copy task has to be split across multiple \gls{dsa}. \par
\section{Data Movement from DDR to HBM}\label{sec:perf:datacopy}
Two methods of splitting will be evaluated. The first being a brute force approach, utilizing all available for any transfer direction. The seconds' behaviour depends on the data source and destination locations. As our system has multiple sockets, communication crossing the socket could experience a latency and bandwidth disadvantage. We argue that for intra-socket transfers, use of the \gls{dsa} from the second socket will have only marginal effect. For transfers crossing the socket, every \gls{dsa} chip will perform badly, and we choose to only use the ones with the fastest possible access, namely the ones on the destination and source node. This might result in lower performance but also uses one fourth of the engines of the brute force approach for inter-socket and half for intra-socket. This gives other threads the chance to utilize these now-free chips. \par
\caption{Xeon Max NUMA-Node Layout \cite[Fig. 14]{intel:maxtuning}}
\label{fig:perf-xeonmaxnuma}
\end{figure}
\begin{itemize}
\item present two copy methods: smart and brute force
\item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket
\item conclude which option makes more sense (smart)
\item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively
\item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force
\end{itemize}
For this benchmark, we copy 1 Gibibyte of data from node 0 to the destination node, using the previously described submission method. For each node utilized, we spawn one thread pinned to it, in charge of submission. We display data for nodes 8, 11, 12 and 15. To understand the selection, we refer to Figure \ref{fig:perf-xeonmaxnuma} which shows the configured systems' node ids and the storage technology they represent. Node 8 accesses the \gls{hbm} on node 0 and is therefore the physically closest destination possible. Node 11 is located diagonally on the chip and therefore the furthest intra-socket operation benchmarked. Node 12 and 15 lie diagonally on the second sockets CPU and should therefore be representative of inter-socket transfer operations. \par
\caption{Throughput for smart copy from DDR to HBM}
\label{fig:perf-peak-smart}
\end{figure}
From the brute force approach in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This invalidates our assumption, that peak bandwidth would be achieved in the intra-socket scenario \todo{find out why maybe?} and goes against the calculated peak throughput of the memory on our system \todo{calculation wrong? other factors?}. While using significantly more resources, the brute force copy shown in Figure \ref{fig:perf-peak-brute} outperforms the smart approach from Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa} of 2 GiB/s for copy to Node 8, 18 GiB/s for Node 11 and 12 and 30 GiB/s for Node 15. From this we conclude that the smart copy assignment is worth to use. Even though it affects the peak performance of one copy, the possibility of completing a second intra-socket copy at the same speed in parallel makes up for this loss in our eyes. \par
\subsection{Data Movement using CPU}
\todo{perform test and write this subsection}
\section{Analysis}
\todo{write this section}
In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline. We also compare CPU and \gls{dsa} for the task of copying data from DDR to \gls{hbm}. \par
\begin{enumerate}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Datum size should therefore be at or above 1 MiB if possible.
\item Subsection \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission.
\item In \ref{subsec:perf:datacopy}, we chose to use the presented smart copy methodology to split copy tasks across multiple \gls{dsa} chips to achieve low utilization with acceptable performance.
\end{enumerate}
\todo{is this enumeration acceptable?}
\begin{itemize}
\item summarize the conclusions and define the point at which dsa makes sense
\item minimum transfer size for batch/nonbatch operation
\item effect of mtsubmit -> no fairness guarantees
\item usage of multiple engines -> no effect
\item smart copy method as the middle-ground between peak throughput and utilization
\item lower utilization of dsa is good when it will be shared between threads/processes