bachelor-thesis/thesis/content/30_performance.tex


								\chapter{Performance Microbenchmarks}

								\label{chap:perf}


								The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}. Therefore, we will perform only a limited amount of benchmarks with the purpose of verifying the figures from \cite{intel:analysis} and analysing best practices and restrictions for applying \gls{dsa} to \gls{qdp}. \par


								\section{Benchmarking Methodology}


								\begin{figure}[h]

								    \centering

								    \includegraphics[width=0.9\textwidth]{images/structo-benchmark-compact.png}

								    \caption{Benchmark Procedure Pseudocode}

								    \label{fig:benchmark-function}

								\end{figure}


								Benchmarks were conducted on an Intel Xeon Max CPU, system configuration following Section \ref{sec:state:setup-and-config} with exclusive access to the system. As Intel's \gls{intel:dml} does not have support for \gls{dsa:dwq}, we ran benchmarks exclusively with access through \gls{dsa:swq}. The application written for the benchmarks can be obtained in source form under \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. With the full source code available we only briefly describe a section of pseudocode, as seen in Figure \ref{fig:benchmark-function}, in the following paragraph. \par


								The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. These timings are marked with yellow background in Figure \ref{fig:benchmark-function}, which shows that we measure three regions during our benchmark with one being the total time and the other two being time for submission and time for completion. To get accurate results, the benchmark is repeated multiple times and for small durations we only evaluate the total time, as the single measurements seemed to contain to many irregularities. At the beginning of each repetition, all threads running will synchronize by use of a barrier. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is then submitted to the engine similar to the submission of a single descriptor. Further steps then follow the example from Section \ref{sec:state:dml} again. \par


								\section{Submission Method}


								With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching and single submission, we will evaluate at which data size which submission method makes sense. We expect single submission to perform worse consistently, with a pronounced effect on smaller transfer sizes. This is assumed, as the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, while the batch only sees this overhead once for multiple copies. \par


								\begin{figure}[h]

								    \centering

								    \includegraphics[width=0.7\textwidth]{images/plot-opt-submitmethod.png}

								    \caption{Throughput for different Submission Methods and Sizes}

								    \label{fig:perf-submitmethod}

								\end{figure}


								In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[p. 6 and 7]{intel:analysis} for normal submission method. \par


								Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. We therefore conclude, that the use of multiple \gls{dsa} is required to fully utilize the available bandwidth of \gls{hbm} which theoretically lies at 256 GB/s \cite[Table I]{hbm-arch-paper}. \par


								\section{Multithreaded Submission}


								\todo{write this section}


								\begin{itemize}

								    \item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared

								    \item show results for all available core counts

								    \item only display the 1engine tests

								    \item show combined total throughput

								    \item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness

								\end{itemize}


								\todo{write this section}


								\section{Data Movement from DDR to HBM} \label{sec:perf:datacopy}


								\todo{write this section}


								\begin{itemize}

								    \item present two copy methods: smart and brute force

								    \item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket

								    \item conclude which option makes more sense (smart)

								    \item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively

								    \item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force

								\end{itemize}


								\section{Analysis}


								\todo{write this section}


								\begin{itemize}

								    \item summarize the conclusions and define the point at which dsa makes sense

								    \item minimum transfer size for batch/nonbatch operation

								    \item effect of mtsubmit -> no fairness guarantees

								    \item usage of multiple engines -> no effect

								    \item smart copy method as the middle-ground between peak throughput and utilization

								    \item lower utilization of dsa is good when it will be shared between threads/processes

								\end{itemize}


								%%% Local Variables:

								%%% TeX-master: "diplom"

								%%% End: