adapt chapter on performance to the corrected measurements

11 months ago · 18ec95e201
3 changed files with 115 additions and 60 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/30_performance.tex
+++ b/thesis/content/30_performance.tex
@ -8,7 +8,7 @@ The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper e
 \section{Benchmarking Methodology}
 \label{sec:perf:method}

-\begin{figure}[ht]
+\begin{figure}[h!tb]
    \centering
    \includegraphics[width=0.9\textwidth]{images/xeonmax-numa-layout.png}
    \caption{Xeon Max \glsentrylong{node} Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate \glsentryshort{numa:node} IDs for manual \glsentryshort{hbm} access and for Cores and \glsentryshort{dram}.}
@ -19,29 +19,27 @@ The benchmarks were conducted on a dual-socket server equipped with two Intel Xe

 As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s, we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par

-\begin{figure}[ht]
+The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
+
+Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par
+
+\begin{figure}[h!tb]
    \centering
-    \begin{subfigure}{0.3\textwidth}
-        \includegraphics[width=\textwidth]{images/nsd-benchmark.pdf}
-        \caption{Outer benchmark function.}
-        \label{fig:bench:outer}
-    \end{subfigure}
-    \hfill
-    \begin{subfigure}{0.7\textwidth}
-        \includegraphics[width=\textwidth]{images/nsd-benchmark-inner.pdf}
-        \caption{Inner benchmark function.}
-        \label{fig:bench:inner}
-    \end{subfigure}
-    \hfill  
-    \caption{Benchmark Procedure Pseudocode. Timing marked with yellow background. Showing location preparation and launch in (a) and work submission in (b).}
-    \label{fig:benchmark-function}
+    \includegraphics[width=0.35\textwidth]{images/nsd-benchmark.pdf}
+    \caption{Outer Benchmark Procedure Pseudocode. Timing marked with yellow background. Showing preparation of memory locations, clearing of cache entries, timing points and synchronized benchmark launch.}
+    \label{fig:benchmark-function:outer}
 \end{figure}

-The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
+To get accurate results, the benchmark is repeated 10 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results \cite{time-longer-sections}. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. \par

-To get accurate results, the benchmark is repeated 10 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:bench:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results \cite{time-longer-sections}. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. \par
+\begin{figure}[h!tb]
+    \centering
+    \includegraphics[width=1.0\textwidth]{images/nsd-benchmark-inner.pdf}
+    \caption{Inner Benchmark Procedure Pseudocode. Showing work submission for single and batch submission.}
+    \label{fig:benchmark-function:inner}
+\end{figure}

-For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:bench:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par
+For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par

 \section{Benchmarks}
 \label{sec:perf:bench}
@ -55,9 +53,9 @@ With each submission, descriptors must be prepared and sent to the underlying ha

 We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences this overhead only once for multiple copies. \par

-\begin{figure}[ht]
+\begin{figure}[h!tb]
    \centering
-    \includegraphics[width=0.7\textwidth]{images/plot-submitmethod.pdf}
+    \includegraphics[width=0.5\textwidth]{images/plot-submitmethod.pdf}
    \caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being node 0, executed by the \glsentryshort{dsa} on node 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
    \label{fig:perf-submitmethod}
 \end{figure}
@ -69,11 +67,11 @@ Another limitation may be observed in this result, namely the inherent throughpu
 \subsection{Multithreaded Submission}
 \label{subsec:perf:mtsubmit}

-As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads - the latter representing the core count of one processing sub-node on the test system. Each configuration is assigned the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this benchmark with sizes of 1 MiB and 1 GiB to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
+As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads, with the latter representing the core count of one processing sub-node on the test system. We spawn multiple threads, all submitting to one \gls{dsa}. Furthermore, we perform this benchmark with sizes of 1 MiB and 1 GiB to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par

-\begin{figure}[ht]
+\begin{figure}[h!tb]
    \centering
-    \includegraphics[width=0.7\textwidth]{images/plot-mtsubmit.pdf}
+    \includegraphics[width=0.5\textwidth]{images/plot-mtsubmit.pdf}
    \caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being node 0, executed by the DSA on node 0.}
    \label{fig:perf-mtsubmit}
 \end{figure}
@ -85,67 +83,88 @@ In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible neg

 Moving data from \gls{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. For each node, the test system is configured with two DIMMs of DDR5-4800. \par

-The naming scheme contains the data rate in megatransfers per second \cite{kingston:ddr5-spec-overview}. We calculate the transfers performed per second as follows:
+The naming scheme contains the data rate in Megatransfers per second. We calculate the transfers performed per second. \cite{kingston:ddr5-spec-overview} \par
+
+\[2\ DIMM * \frac{4800\ MT}{s\ *\ DIMM} = 9600\ MT/s\]

-\[2\ DIMMs * \frac{4800\ MT}{s\ and\ DIMM} = 76800\ T/s\]
+The data width of DDR5 is 64 bit. We calculate the amount of Bytes per Transfer. \cite{kingston:ddr5-spec-overview} \par

-The data width of DDR5 is 64 bit \cite{kingston:ddr5-spec-overview}. The theoretical peak transfer speed in GiB/s therefore is:
+\[\frac{64b}{8b/B}\ /\ Transfer = 8B\ /\ Transfer\]

-\[76800\ T/s * \frac{64b}{8b/B} = 75\ GiB/s\]
+Using the results from the previous calculations, we are now able to calculate the theoretical peak throughput speed per Node on our test system. \par

-\par
+\[9600\ MT/s * 8B/T = 75\ GiB/s\]

-We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s. Two methods of splitting will be evaluated. The first employs a brute force approach, utilizing all available resources for any transfer direction. The second method's behaviour depends on the data source and destination locations. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}. We posit that for intra-socket transfers, utilizing the \gls{dsa} from the second socket will have only a marginal effect. For transfers crossing sockets, we assume every \gls{dsa} performs equally worse, prompting us to use only the ones on the destination and source nodes for them being the physically closest to both memory regions. While this choice may result in lower performance, it uses only one-fourth of the engines in the brute force approach for inter-socket transfers and half for intra-socket transfers. This approach also frees up additional chips for other threads to utilize.\par
+We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s. Two methods of splitting will be evaluated. The first employs a brute force approach, utilizing all available resources for any transfer direction. The second method's behaviour depends on the data source and destination locations. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}. We posit that for intra-socket transfers, utilizing the \gls{dsa} from the second socket will have only a marginal effect. For transfers crossing sockets, we assume every \gls{dsa} performs equally worse, prompting us to use only the ones on the destination and source nodes for them being the physically closest to both memory regions. While this choice may result in lower performance, it uses only one-fourth of the engines of the brute force approach for inter-socket transfers and half for intra-socket transfers. This approach also frees up additional chips for other threads to utilize.\par

-For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node, employing the submission method previously described. For each utilized node, we spawn one pinned thread responsible for submission. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the furthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
+\todo{update, add push-pull description}

-\begin{figure}[ht]
+For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node, employing the submission method previously described. For each utilized node, we spawn one pinned thread responsible for submission. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
+\begin{figure}[h!tb]
    \centering
-    \begin{subfigure}{0.4\textwidth}
-        \includegraphics[width=\textwidth]{images/plot-perf-smart-throughput-selectbarplot.pdf}
+    \begin{subfigure}[t]{0.3\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/plot-allnodes-throughput.pdf}
        \caption{Brute Force Assignment: using all available \glsentryshort{dsa}, irrespective of source and destination locations.}
+        \label{fig:perf-dsa:allnodes}
+    \end{subfigure}
+    \hspace{5mm}
+    \begin{subfigure}[t]{0.3\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/plot-smart-throughput.pdf}
+        \caption{Smart Assignment: using four on-socket \glsentryshort{dsa} for intra-socket and the \glsentryshort{dsa} on source and destination \glsentryshort{numa:node} for inter-socket.}
        \label{fig:perf-dsa:smart}
    \end{subfigure}
-    \hfill
-    \begin{subfigure}{0.4\textwidth}
-        \includegraphics[width=\textwidth]{images/plot-perf-allnodes-throughput-selectbarplot.pdf}
-        \caption{Smart Assignment: using four on-socket \glsentryshort{dsa} for intra-socket operation and the \glsentryshort{dsa} on source and destination node for inter-socket copies.}
-        \label{fig:perf-dsa:allnodes}
+    \hspace{5mm}
+    \begin{subfigure}[t]{0.3\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/plot-smart-throughput.pdf}
+        \caption{Push-Pull: using the \glsentryshort{dsa} on source and destination \glsentryshort{numa:node} except intra node using two off-node but on-socket \gls{dsa}.}
+        \label{fig:perf-dsa:pushpull}
    \end{subfigure}
-    \hfill  
-    \caption{Copying 1 GiB from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. Shows peak throughput achievable with \glsentryshort{dsa} for different load balancing techniques.}
+    \caption{Copy from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis. Shows peak throughput achievable with \glsentryshort{dsa} for different load balancing techniques.}
    \label{fig:perf-dsa}
 \end{figure}

-\todo{update}
+We first look at behaviour common for the three load balancing techniques in Figure \ref{fig:perf-dsa}. The real world peak throughput reaches close to 64 GiB/s, which, as we will see in Section \ref{subsec:perf:cpu-datacopy}, aligns with what can be achieved in select scenarios with the CPU. Additionally, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11, leading us to believe that the \gls{dsa} here encounters some shared data paths, as \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, from which the data originates. Another interesting observation is that, contrary to our assumption, the physically more distant (from data origin \gls{numa:node} 0) \gls{numa:node} 15 reaches higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine the behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par
+
+From the results of the brute force approach illustrated in Figure \ref{fig:perf-dsa:allnodes}, we observe peak speeds of close to 64 GiB/s when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra node copies, there exists an observable penalty for using the off-socket \gls{dsa}s. When comparing with push-pull in Figure \ref{fig:perf-dsa:pushpull}, performance actually decreases by utilizing four times more resources over a longer duration. As the brute force approach is still slightly faster than push-pull and smart-assignment, utilizing more resources still yields some gains, although far from linear. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \par

-From the results of the brute force approach illustrated in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of close to 64 GiB/s when copying across the socket from Node 0 to Node 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect.  \par
+Smart \gls{numa:node} assignment results in peak performance observable for intra socket data movement, see Figure \ref{fig:perf-dsa:smart}. Operations crossing the socket boundary, namely to \gls{numa:node}s 12 and 15, are slightly slower than the peak observed for brute force assignment, visible in Figure \ref{fig:perf-dsa:allnodes}. At the same time, these use one fourth of the available resources, again showing less-than-linear scaling of throughput with the amount of participating \gls{dsa}s. \par

-While consuming significantly more resources, the brute force copy depicted in Figure \ref{fig:perf-peak-brute} surpasses the performance of the smart approach shown in Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa}, achieving 2 GiB/s for copying to Node 8, 18 GiB/s for Nodes 11 and 12, and 30 GiB/s for Node 15. The smart approach could accommodate another intra-socket copy on the second socket, we assume, without observing negative impacts. From this, we conclude that the smart copy assignment is worth using, as it provides better scalability. \par
+\todo{evaluate pushpull here when benchmark is done}
+
+While consuming significantly more resources, the brute force copy depicted in Figure \ref{fig:perf-dsa:allnodes} surpasses the performance of the smart approach shown in Figure \ref{fig:perf-dsa:smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa}, achieving 2 GiB/s for copying to \gls{numa:node} 8, 18 GiB/s for \gls{numa:node}s 11 and 12, and 30 GiB/s for \gls{numa:node} 15. The smart approach could accommodate another intra-socket copy on the second socket, we assume, without observing negative impacts. From this, we conclude that the smart copy assignment is worth using, as it provides better scalability. \par\
+
+\todo{update when pushpull is completed}

 \subsection{Data Movement using CPU}
+\label{subsec:perf:cpu-datacopy}

-For evaluating CPU copy performance we use the benchmark code from Section \ref{subsec:perf:datacopy}, selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. \par
+For evaluating CPU copy performance we use the benchmark code from the subsequent Section \ref{subsec:perf:datacopy}, selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par

-\begin{figure}[ht]
+\begin{figure}[h!tb]
    \centering
-    \begin{subfigure}{0.4\textwidth}
+    \begin{subfigure}[t]{0.35\textwidth}
+        \centering
        \includegraphics[width=\textwidth]{images/plot-allnodes-cpu-throughput.pdf}
-        \caption{Using allnodes code with software path.}
+        \caption{DML code for allnodes running on software path.}
        \label{fig:perf-cpu:swpath}
    \end{subfigure}
-    \hfill
-    \begin{subfigure}{0.4\textwidth}
+    \hspace{8mm}
+    \begin{subfigure}[t]{0.35\textwidth}
+        \centering
        \includegraphics[width=\textwidth]{images/plot-andrepeak-cpu-throughput.pdf}
-        \caption{Colleague's benchmark \cite{xeonmax-peakthroughput}.}
+        \caption{Colleague's CPU peak throughput benchmark \cite{xeonmax-peakthroughput} results.}
        \label{fig:perf-cpu:andrepeak}
    \end{subfigure}
-    \hfill  
    \caption{Throughput from \glsentryshort{dram} to \glsentryshort{hbm} on CPU. Copying from \glsentryshort{numa:node} 0 to the destination \glsentryshort{numa:node} specified on the x-axis.}
    \label{fig:perf-cpu}
 \end{figure}

-The 
+As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. As the sole benchmark, the software path however performs as expected for transfers to \gls{numa:node}s 12 and 15, with the latter performing worse. Taking the layout from Figure \ref{fig:perf-xeonmaxnuma}, back in the previous Section, we assumed that \gls{numa:node} 12 would outperform \gls{numa:node} 15 due to lower physical distance. This assumption was invalidated, making the result for CPU in this case unexpected. \par
+
+In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra node operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios. \par

 \section{Analysis}
 \label{sec:perf:analysis}
@ -153,14 +172,16 @@ The
 In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm}. \par

 \begin{itemize}
-    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB if possible.
-    \item Subsection \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission.
-    \item In \ref{subsec:perf:datacopy}, we chose to use the presented smart copy methodology to split copy tasks across multiple \gls{dsa} chips to achieve low utilization with acceptable performance.
+    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB and otherwise use the CPU.
+    \item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
+    \item In \ref{subsec:perf:datacopy}, we chose to use the presented smart copy methodology to split copy tasks across multiple \gls{dsa} chips to achieve low utilization with acceptable performance. \todo{evaluate after pushpull results}
 \end{itemize}

-\todo{update}
+Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs similar to the CPU, demonstrating potential for faster data movement while simultaneously freeing up cycles for other tasks. \par
+
+We discovered an anomaly for \gls{numa:node} 12 for which we did not find an explanation. As the behaviour is also exhibited by the CPU, discovering the root issue falls outside the scope of this work. \par

-Once again, we refer to Figures \ref{fig:perf-cpu} and \ref{fig:perf-cpu-peak-brute}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs at or above CPUs peak throughput, demonstrating potential for faster data movement while simultaneously freeing up CPU cycles. \par
+Even though we could not find an explanation for all measurements, this chapter still gives insight into the performance of the \gls{dsa}, its strengths and its weaknesses. It provides data-driven guidance on a complex architecture, helping to find the optimum for applying the \gls{dsa} to our expected and possibly different workloads. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"
--- a/thesis/own.bib
+++ b/thesis/own.bib
@ -213,11 +213,11 @@
    urldate = {2024-01-21}
 }

-@unpublished{dimes-prefetching,
+@misc{dimes-prefetching,
    author = {André Berthold and Anna Bartuschka and Dirk Habich and Wolfgang Lehner and Horst Schirmeier},
    title = {{Towards Query-Driven Prefetching to Optimize Data Pipelines in Heterogeneous Memory Systems}},
    date = {2023},
-    publisher = {ACM}
+    howpublished = "unpublished"
 }

@ONLINE{microsoft:numa-malloc,
@ -226,4 +226,38 @@
    date = {2021-07-01},
    url = {https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node},
    urldate = {2024-01-28}
+}
+
+@ONLINE{kingston:ddr5-spec-overview,
+    author = {Kingston},
+    title = {{DDR5 memory standard: An introduction to the next generation of DRAM module technology}},
+    date = {2024-01},
+    url = {https://www.kingston.com/en/blog/pc-performance/ddr5-overview},
+    urldate = {2024-02-04}
+}
+
+@ARTICLE{bench:heterogeneous-communication,
+    author={Thune, Andreas and Reinemo, Sven-Arne and Skeie, Tor and Cai, Xing},
+    journal={IEEE Transactions on Parallel and Distributed Systems}, 
+    title={Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication}, 
+    year={2023},
+    volume={34},
+    number={5},
+    pages={1580-1593},
+    keywords={Bandwidth;Sockets;Benchmark testing;Size measurement;Protocols;Multicore processing;Computational modeling;Intra-node communication;performance modeling;point-to-point MPI communication},
+    doi={10.1109/TPDS.2023.3253881}
+}
+
+@misc{xeonmax-peakthroughput,
+  author = "André Berthold and Anna Bartuschka",
+  title={{Throughput Benchmarks for CPU}},
+  date = "2023",
+  howpublished = "personal communication"
+}
+
+@misc{time-longer-sections,
+  author = "Horst Schirmeier",
+  title={{Length of timed Sections}},
+  date = "2023",
+  howpublished = "personal communication"
 }