@ -95,48 +95,75 @@ Using the results from the previous calculations, we are now able to calculate t
\[9600\ MT/s *8B/T =75\ GiB/s\]
We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s. Two methods of splitting will be evaluated. The first employs a brute force approach, utilizing all available resources for any transfer direction. The second method's behaviour depends on the data source and destination locations. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}. We posit that for intra-socket transfers, utilizing the \gls{dsa} from the second socket will have only a marginal effect. For transfers crossing sockets, we assume every \gls{dsa} performs equally worse, prompting us to use only the ones on the destination and source nodes for them being the physically closest to both memory regions. While this choice may result in lower performance, it uses only one-fourth of the engines of the brute force approach for inter-socket transfers and half for intra-socket transfers. This approach also frees up additional chips for other threads to utilize.\par
We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}. \par
\todo{update, add push-pull description}
To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilise the ones found on data source and destination node. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node, employing the submission method previously described. For each utilized node, we spawn one pinned thread responsible for submission. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
\caption{Smart Assignment: using four on-socket \glsentryshort{dsa} for intra-socket and the \glsentryshort{dsa} on source and destination \glsentryshort{numa:node}for inter-socket.}
\caption{Two \gls{dsa}s, or \enquote{Push-Pull}: using the \glsentryshort{dsa} on source and destination \glsentryshort{numa:node}except intra-node using the on-node and one off-node.}
\caption{Push-Pull: using the \glsentryshort{dsa} on source and destination \glsentryshort{numa:node} except intra node using two off-node but on-socket \gls{dsa}.}
\caption{Eight \gls{dsa}s or \enquote{Brute-Force}: using all available \glsentryshort{dsa}, irrespective of source and destination locations.}
\label{fig:perf-dsa:8}
\end{subfigure}
\caption{Copy from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. Shows peak throughput achievable with \glsentryshort{dsa} for different load balancing techniques.}
\label{fig:perf-dsa}
\end{figure}
We first look at behaviour common for the three load balancing techniques in Figure \ref{fig:perf-dsa}. The real world peak throughput reaches close to 64 GiB/s, which, as we will see in Section \ref{subsec:perf:cpu-datacopy}, aligns with what can be achieved in select scenarios with the CPU. Additionally, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11, leading us to believe that the \gls{dsa} here encounters some shared data paths, as \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, from which the data originates. Another interesting observation is that, contrary to our assumption, the physically more distant (from data origin \gls{numa:node} 0) \gls{numa:node} 15 reaches higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine the behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par
We begin by examining the common behaviour of load balancing techniques depicted in Figure \ref{fig:perf-dsa}. The real-world peak throughput approaches nearly 64 GiB/s, aligning with the maximum achievable with the CPU, as demonstrated in Section \ref{subsec:perf:cpu-datacopy}. In Figure \ref{fig:perf-dsa:1}, a notable hard bandwidth limit is observed, just below the 30 GiB/s mark, reinforcing what was encountered in Section \ref{subsec:perf:submitmethod}: a single \gls{dsa} is constrained by I/O-Fabric limitations. \par
From the results of the brute force approach illustrated in Figure \ref{fig:perf-dsa:allnodes}, we observe peak speeds of close to 64 GiB/s when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra node copies, there exists an observable penalty for using the off-socket \gls{dsa}s. When comparing with push-pull in Figure \ref{fig:perf-dsa:pushpull}, performance actually decreases by utilizing four times more resources over a longer duration. As the brute force approach is still slightly faster than push-pull and smart-assignment, utilizing more resources still yields some gains, although far from linear. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \par
Unexpected throughput differences are evident for all configurations, except the bandwidth-bound single \gls{dsa}. Notably, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11. As \gls{numa:node} 8 serves as the \gls{hbm} accessor for the data source \gls{numa:node}, it should have the shortest data path. This suggests that the \gls{dsa} may suffer from sharing parts of the data path for reading and writing. Another interesting observation is that, contrary to our assumption, the physically more distant \gls{numa:node} 15 achieves higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine this behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par
Smart \gls{numa:node} assignment results in peak performance observable for intra socket data movement, see Figure \ref{fig:perf-dsa:smart}. Operations crossing the socket boundary, namely to \gls{numa:node}s 12 and 15, are slightly slower than the peak observed for brute force assignment, visible in Figure \ref{fig:perf-dsa:allnodes}. At the same time, these use one fourth of the available resources, again showing less-than-linear scaling of throughput with the amount of participating \gls{dsa}s. \par
For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par
\todo{evaluate pushpull here when benchmark is done}
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \\\(\frac{Throughput}{Basline\ Throughput}*\frac{1}{Utilization\ Factor}\)\\ with the baseline being Throughput for 1 \gls{dsa} and the utilization factor representing the factor of the amount of \gls{dsa}s being used over the baseline.}
\label{fig:perf-dsa-analysis:scaling}
\end{subfigure}
\caption{Scalability Analysis for different amounts of participating \gls{dsa}s. Displays the average throughput and the derived scaling factor. Shows that, although the throughput does increase with adding more accelerators, beyond two, the gained speed drops significantly. Calculated over the results from Figure \ref{fig:perf-dsa} and therefore applies to copies from \glsentryshort{dram} to \glsentryshort{hbm}.}
\label{fig:perf-dsa-analysis}
\end{figure}
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa:2}, performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa-analysis:scaling}, using Brute-Force still leads to a slight increase in overall throughput, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \par
While consuming significantly more resources, the brute force copy depicted in Figure \ref{fig:perf-dsa:allnodes} surpasses the performance of the smart approach shown in Figure \ref{fig:perf-dsa:smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa}, achieving 2 GiB/s for copying to \gls{numa:node} 8, 18 GiB/s for \gls{numa:node}s 11 and 12, and 30 GiB/s for \gls{numa:node} 15. The smart approach could accommodate another intra-socket copy on the second socket, we assume, without observing negative impacts. From this, we conclude that the smart copy assignment is worth using, as it provides better scalability. \par\
From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par
\todo{update when pushpull is completed}
The choice of a load balancing method is not trivial. If peak throughput of one task is of relevance, Brute-Force for inter-socket and four \gls{dsa}s for intra-socket operation result in the fastest transfers. At the same time, these cause high system utilization, making them unsuitable for situations where multiple tasks may be submitted. For these cases, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling — see Figure \ref{fig:perf-dsa-analysis:scaling}. \par
\subsection{Data Movement using CPU}
\label{subsec:perf:cpu-datacopy}
@ -147,14 +174,14 @@ For evaluating CPU copy performance we use the benchmark code from the subsequen
\caption{Colleague's CPU peak throughput benchmark \cite{xeonmax-peakthroughput} results.}
\label{fig:perf-cpu:andrepeak}
\end{subfigure}
@ -164,7 +191,7 @@ For evaluating CPU copy performance we use the benchmark code from the subsequen
As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. As the sole benchmark, the software path however performs as expected for transfers to \gls{numa:node}s 12 and 15, with the latter performing worse. Taking the layout from Figure \ref{fig:perf-xeonmaxnuma}, back in the previous Section, we assumed that \gls{numa:node} 12 would outperform \gls{numa:node} 15 due to lower physical distance. This assumption was invalidated, making the result for CPU in this case unexpected. \par
In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intranode operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios. \par
In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios. \par
\section{Analysis}
\label{sec:perf:analysis}
@ -174,13 +201,15 @@ In this section we summarize the conclusions drawn from the three benchmarks per
\begin{itemize}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB and otherwise use the CPU.
\item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
\item In \ref{subsec:perf:datacopy}, we chose to use the presented smart copy methodology to split copy tasks across multiple \gls{dsa} chips to achieve low utilization with acceptable performance. \todo{evaluate after pushpull results}
\item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
\end{itemize}
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs similar to the CPU, demonstrating potential for faster data movement while simultaneously freeing up cycles for other tasks. \par
We discovered an anomaly for \gls{numa:node} 12 for which we did not find an explanation. As the behaviour is also exhibited by the CPU, discovering the root issue falls outside the scope of this work. \par
Scaling was found to be less than linear. No conclusive answer was given for this either. We assumed that this happens due to increased congestion of the interconnect. \par
Even though we could not find an explanation for all measurements, this chapter still gives insight into the performance of the \gls{dsa}, its strengths and its weaknesses. It provides data-driven guidance on a complex architecture, helping to find the optimum for applying the \gls{dsa} to our expected and possibly different workloads. \par