\caption{Xeon Max NUMA-Node Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat, showing separate Node IDs for manual HBM access and for Cores and DDR memory}
\caption{Xeon Max NUMA-Node Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate Node IDs for manual HBM access and for Cores and DDR memory.}
\label{fig:perf-xeonmaxnuma}
\end{figure}
@ -69,22 +69,42 @@ For this benchmark, we copy 1 Gibibyte of data from node 0 to the destination no
\caption{Throughput for brute force copy from DDR to HBM, using all available DSA, copying 1 GiB from Node 0 to the Destination Node specified on the x-axis}
\caption{Throughput for brute force copy from DDR to HBM. Using all available DSA. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. Shows peak achievable with DSA.}
\caption{Throughput for smart copy from DDR to HBM, using four on-socket DSA for intra-socket operation and the DSA on source and destination node for inter-socket, copying 1 GiB from Node 0 to the Destination Node specified on the x-axis }
\caption{Throughput for smart copy from DDR to HBM. Using four on-socket DSA for intra-socket operation and the DSA on source and destination node for inter-socket. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis.Shows conservative performance.}
\label{fig:perf-peak-smart}
\end{figure}
From the brute force approach in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This invalidates our assumption, that peak bandwidth would be achieved in the intra-socket scenario \todo{find out why maybe?} and goes against the calculated peak throughput of the memory on our system \todo{calculation wrong? other factors?}. While using significantly more resources, the brute force copy shown in Figure \ref{fig:perf-peak-brute} outperforms the smart approach from Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa} of 2 GiB/s for copy to Node 8, 18 GiB/s for Node 11 and 12 and 30 GiB/s for Node 15. From this we conclude that the smart copy assignment is worth to use. Even though it affects the peak performance of one copy, the possibility of completing a second intra-socket copy at the same speed in parallel makes up for this loss in our eyes \todo{reformulate this sentence to make it less of a jumbled mess}. \par
From the brute force approach in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This invalidates our assumption, that peak bandwidth would be achieved in the intra-socket scenario \todo{find out why maybe?} and goes against the calculated peak throughput of the memory on our system \todo{calculation wrong? other factors?}. The results however, align with findings in \cite[Fig. 10]{intel:analysis}. \par
While using significantly more resources, the brute force copy shown in Figure \ref{fig:perf-peak-brute} outperforms the smart approach from Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa} of 2 GiB/s for copy to Node 8, 18 GiB/s for Node 11 and 12 and 30 GiB/s for Node 15. The smart approach could accommodate another intra-socket copy on the second socket, we assume without seeing negative impact. From this we conclude that the smart copy assignment is worth to use, as it provides better scalability. \par
\subsection{Data Movement using CPU}
\todo{perform test and write this subsection}
For evaluating CPU copy performance two approaches were selected. For Figure \ref{fig:perf-cpu-peak-smart} we used the smart copy procedure as in \ref{subsec:perf:datacopy}, just selecting the software path, with no other changes to the test. In Figure \ref{fig:perf-cpu-peak-brute} the test was adapted to spawn 12 threads on Node 0. \par
\caption{Throughput from DDR to HBM on CPU. Using 12 Threads spawned on Node 0 for the task. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis.}
\label{fig:perf-cpu-peak-brute}
\end{figure}
The benchmark resulting in Figure \ref{fig:perf-cpu-peak-brute} fully utilizes node 0 of the test system through spawning 12 threads. We attribute the low performance of the inter-socket copy operation to overhead of the interconnect. The slight difference between node 12 and 15 may be explained using Figure \ref{fig:perf-xeonmaxnuma}, where we observe that physically node 12 is closer to node 0 than node 15. \par
\caption{Throughput from DDR to HBM on CPU. Using the exact same code as smart copy, therefore spawning one thread on Node 0, 1, 2 and 3. Copying 1 GiB from Node 0 to the Destination Node specified on the x-axis. This shows the low performance of software path, when not adapting the code.}
\label{fig:perf-cpu-peak-smart}
\end{figure}
Comparing the results in Figure \ref{fig:perf-cpu-peak-smart} with the data from the adapted test displayed in Figure \ref{fig:perf-cpu-peak-brute} we conclude that the software path just serves compatibility. The throughput is comparatively low, requiring code changes to increase performance. This leads us not to recommend using it. \par
\section{Analysis}
@ -96,7 +116,7 @@ In this section we summarize the conclusions drawn from the three benchmarks per
\item In \ref{subsec:perf:datacopy}, we chose to use the presented smart copy methodology to split copy tasks across multiple \gls{dsa} chips to achieve low utilization with acceptable performance.
\end{enumerate}
\todo{compare cpu and dsa for data movement}
We again refer to Figures \ref{fig:perf-peak-brute} and \ref{fig:perf-cpu-peak-brute} which both represent the maximum throughput achieved with utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from the inter-socket overhead like the CPU. To the contrary, we do see the highest throughput when copying across sockets \todo{explanation?}. In any case, \gls{dsa} outperforms the CPU for transfer sizes over 1 MiB, demonstrating its potential to increase throughput while at the same time freeing CPU cycles. \par