bachelor-thesis/thesis/content/30_performance.tex

\chapter{Performance Microbenchmarks}
\label{chap:perf}

\todo{write introductory paragraph}

mention article by reese cooper here

\section{Benchmarking Methodology}

\begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{images/structo-benchmark.png}
    \caption{Throughput for different Submission Methods and Sizes}
    \label{fig:perf-submitmethod}
\end{figure}

\todo{split graphic into multiple parts for the three submission types}

\todo{write this section}

\section{Submission Method}

With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching, single submission and utilizing the \gls{dsa}s queue with multi submission we will evaluate at which data size which submission method makes sense. \par

\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{images/plot-opt-submitmethod.png}
    \caption{Throughput for different Submission Methods and Sizes}
    \label{fig:perf-submitmethod}
\end{figure}

\todo{maybe remeassure with 8 KiB, 16 KiB as sizes}

In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. We assume that high submission cost of the \gls{dsa:swq} cause all but the batch, which only performs one submission for its many descriptors, to suffer. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[p. 6 and 7]{intel:analysis}. \par

Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par

\section{Multithreaded Submission}

\todo{write this section}

\begin{itemize}
    \item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
    \item show results for all available core counts
    \item only display the 1engine tests
    \item show combined total throughput
    \item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
\end{itemize}

\todo{write this section}

\section{Data Movement from DDR to HBM} \label{sec:perf:datacopy}

\todo{write this section}

\begin{itemize}
    \item present two copy methods: smart and brute force
    \item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket
    \item conclude which option makes more sense (smart)
    \item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively
    \item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force
\end{itemize}

\section{Analysis}

\todo{write this section}

\begin{itemize}
    \item summarize the conclusions and define the point at which dsa makes sense
    \item minimum transfer size for batch/nonbatch operation
    \item effect of mtsubmit -> no fairness guarantees
    \item usage of multiple engines -> no effect
    \item smart copy method as the middle-ground between peak throughput and utilization
    \item lower utilization of dsa is good when it will be shared between threads/processes
\end{itemize}

\cleardoublepage

%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: