\item submit cost analysis: best method and for a subset the point at which submit cost < time savings
\item submit cost analysis: best method and for a subset the point at which submit cost < time savings
\item display the full opt-submitmethod graph
\item maybe remeassure with higher amount of small copies? results look somewhat weird for 1k and 4k
\item display the stacked bar of submit and complete time for single@1k, single@4k, single@1mib for HW-path and SW-path
\item display the stacked bar of submit and complete time for batch50@1k, batch50@4k, batch50@1mib for HW-path and SW-path
\item show batch because we care about the minimum task set size for a single producer (multi submit would be used for different task sets)
\item conclude at which point using the DSA makes sense
\end{itemize}
\end{itemize}
\section{Multithreaded Submission}
\section{Multithreaded Submission}
\begin{itemize}
\begin{itemize}
\item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
\item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
\item show results for all available core counts
\item only display the 1engine tests
\item show combined total throughput
\item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
\end{itemize}
\section{Multiple Engines in a Group}
\begin{itemize}
\item assumed from arch spec that multiple engines lead to greater Performance
\item reason is that page faults and access latency will be overlapped with preparing the next operation
\item in the given scenario we observe the opposite, slight performance decrease
\item show multisubmit 50 for both 1e and 4e
\item maybe remeassure with each submission accessing different memory region?
\item conclusion?
\end{itemize}
\end{itemize}
\section{Data Movement from DDR to HBM}
\section{Data Movement from DDR to HBM}
\begin{itemize}
\begin{itemize}
\item
\item present two copy methods: smart and brute force
\item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket
\item conclude which option makes more sense (smart)
\item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively
\item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force
\end{itemize}
\end{itemize}
\section{Analysis}
\section{Analysis}
\begin{itemize}
\begin{itemize}
\item
\item summarize the conclusions and define the point at which dsa makes sense
\item minimum transfer size for batch/nonbatch operation
\item effect of mtsubmit -> no fairness guarantees
\item usage of multiple engines -> no effect
\item smart copy method as the middle-ground between peak throughput and utilization
\item lower utilization of dsa is good when it will be shared between threads/processes