\caption{Throughput for different Submission Methods and Sizes}
\label{fig:perf-submitmethod}
\end{figure}
\todo{split graphic into multiple parts for the three submission types}
\todo{write this section}
\todo{write this section}
\section{Submission Method}
\section{Submission Method}
\todo{write this section}
With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching, single submission and utilizing the \gls{dsa}s queue with multi submission we will evaluate at which data size which submission method makes sense. \par
\begin{itemize}
\item submit cost analysis: best method and for a subset the point at which submit cost < time savings
\item display the full opt-submitmethod graph
\item maybe remeassure with higher amount of small copies? results look somewhat weird for 1k and 4k
\item display the stacked bar of submit and complete time for single@1k, single@4k, single@1mib for HW-path and SW-path
\item display the stacked bar of submit and complete time for batch50@1k, batch50@4k, batch50@1mib for HW-path and SW-path
\item show batch because we care about the minimum task set size for a single producer (multi submit would be used for different task sets)
\item conclude at which point using the DSA makes sense
\caption{Throughput for different Submission Methods and Sizes}
\label{fig:perf-submitmethod}
\end{figure}
\todo{maybe remeassure with 8 KiB, 16 KiB as sizes}
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. We assume that high submission cost of the \gls{dsa:swq} cause all but the batch, which only performs one submission for its many descriptors, to suffer. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]}\cite[p. 6 and 7]{intel:analysis}. \par
Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
\section{Multithreaded Submission}
\section{Multithreaded Submission}
@ -53,19 +49,6 @@
\todo{write this section}
\todo{write this section}
\section{Multiple Engines in a Group}
\todo{write this section}
\begin{itemize}
\item assumed from arch spec that multiple engines lead to greater Performance
\item reason is that page faults and access latency will be overlapped with preparing the next operation
\item in the given scenario we observe the opposite, slight performance decrease
\item show multisubmit 50 for both 1e and 4e
\item maybe remeassure with each submission accessing different memory region?
\item conclusion?
\end{itemize}
\section{Data Movement from DDR to HBM}\label{sec:perf:datacopy}
\section{Data Movement from DDR to HBM}\label{sec:perf:datacopy}