start writing chapter 03

11 months ago · 2eb2bcdfe9
3 changed files with 24 additions and 41 deletions
--- a/thesis/content/30_performance.tex
+++ b/thesis/content/30_performance.tex
@ -1,43 +1,39 @@
 \chapter{Performance Microbenchmarks}
 \label{chap:perf}
 % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
 % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
 % sich lohnen, verschiedene Möglichkeiten durchzuspielen und dann
 % explizit zu begründen, warum man sich für eine bestimmte entschieden
 % hat. Dieses Kapitel sollte - zumindest in Stichworten - schon bei den
 % ersten Festlegungen eines Entwurfs skizziert werden.
 % Es wird sich aber in einer normal verlaufenden
 % Arbeit dauernd etwas daran ändern. Das Kapitel darf nicht zu
 % detailliert werden, sonst langweilt sich der Leser. Es ist sehr
 % wichtig, das richtige Abstraktionsniveau zu finden. Beim Verfassen
 % sollte man auf die Wiederverwendbarkeit des Textes achten.
 % Plant man eine Veröffentlichung aus der Arbeit zu machen, können von
 % diesem Kapitel Teile genommen werden. Das Kapitel wird in der Regel
 % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
 % sein, daß das Abstraktionsniveau verfehlt wurde.
 \todo{write introductory paragraph}
 mention article by reese cooper here
 \section{Benchmarking Methodology}
 \begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{images/structo-benchmark.png}
    \caption{Throughput for different Submission Methods and Sizes}
    \label{fig:perf-submitmethod}
 \end{figure}
 \todo{split graphic into multiple parts for the three submission types}
 \todo{write this section}
 \section{Submission Method}
 \todo{write this section}
 With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching, single submission and utilizing the \gls{dsa}s queue with multi submission we will evaluate at which data size which submission method makes sense. \par
 \begin{itemize}
    \item submit cost analysis: best method and for a subset the point at which submit cost < time savings
    \item display the full opt-submitmethod graph
    \item maybe remeassure with higher amount of small copies? results look somewhat weird for 1k and 4k
    \item display the stacked bar of submit and complete time for single@1k, single@4k, single@1mib for HW-path and SW-path
    \item display the stacked bar of submit and complete time for batch50@1k, batch50@4k, batch50@1mib for HW-path and SW-path
    \item show batch because we care about the minimum task set size for a single producer (multi submit would be used for different task sets)
    \item conclude at which point using the DSA makes sense
 \end{itemize}
 \begin{figure}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{images/plot-opt-submitmethod.png}
    \caption{Throughput for different Submission Methods and Sizes}
    \label{fig:perf-submitmethod}
 \end{figure}
 \todo{maybe remeassure with 8 KiB, 16 KiB as sizes}
 In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. We assume that high submission cost of the \gls{dsa:swq} cause all but the batch, which only performs one submission for its many descriptors, to suffer. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[p. 6 and 7]{intel:analysis}. \par
 Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
 \section{Multithreaded Submission}
@ -53,19 +49,6 @@
 \todo{write this section}
 \section{Multiple Engines in a Group}
 \todo{write this section}
 \begin{itemize}
    \item assumed from arch spec that multiple engines lead to greater Performance
    \item reason is that page faults and access latency will be overlapped with preparing the next operation
    \item in the given scenario we observe the opposite, slight performance decrease
    \item show multisubmit 50 for both 1e and 4e
    \item maybe remeassure with each submission accessing different memory region?
    \item conclusion?
 \end{itemize}
 \section{Data Movement from DDR to HBM} \label{sec:perf:datacopy}
 \todo{write this section}
--- a/thesis/images/plot-opt-submitmethod.png
+++ b/thesis/images/plot-opt-submitmethod.png
--- a/thesis/images/structo-benchmark.png
+++ b/thesis/images/structo-benchmark.png