You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
90 lines
3.9 KiB
90 lines
3.9 KiB
\chapter{Performance Microbenchmarks}
|
|
\label{sec:perf}
|
|
|
|
% Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
|
|
% eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
|
|
% sich lohnen, verschiedene Möglichkeiten durchzuspielen und dann
|
|
% explizit zu begründen, warum man sich für eine bestimmte entschieden
|
|
% hat. Dieses Kapitel sollte - zumindest in Stichworten - schon bei den
|
|
% ersten Festlegungen eines Entwurfs skizziert werden.
|
|
% Es wird sich aber in einer normal verlaufenden
|
|
% Arbeit dauernd etwas daran ändern. Das Kapitel darf nicht zu
|
|
% detailliert werden, sonst langweilt sich der Leser. Es ist sehr
|
|
% wichtig, das richtige Abstraktionsniveau zu finden. Beim Verfassen
|
|
% sollte man auf die Wiederverwendbarkeit des Textes achten.
|
|
|
|
% Plant man eine Veröffentlichung aus der Arbeit zu machen, können von
|
|
% diesem Kapitel Teile genommen werden. Das Kapitel wird in der Regel
|
|
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
|
|
% sein, daß das Abstraktionsniveau verfehlt wurde.
|
|
|
|
\begin{itemize}
|
|
\item
|
|
\end{itemize}
|
|
|
|
\section{Benchmarking Methodology}
|
|
|
|
\begin{itemize}
|
|
\item
|
|
\end{itemize}
|
|
|
|
\section{Submission Method}
|
|
|
|
\begin{itemize}
|
|
\item submit cost analysis: best method and for a subset the point at which submit cost < time savings
|
|
\item display the full opt-submitmethod graph
|
|
\item maybe remeassure with higher amount of small copies? results look somewhat weird for 1k and 4k
|
|
\item display the stacked bar of submit and complete time for single@1k, single@4k, single@1mib for HW-path and SW-path
|
|
\item display the stacked bar of submit and complete time for batch50@1k, batch50@4k, batch50@1mib for HW-path and SW-path
|
|
\item show batch because we care about the minimum task set size for a single producer (multi submit would be used for different task sets)
|
|
\item conclude at which point using the DSA makes sense
|
|
\end{itemize}
|
|
|
|
\section{Multithreaded Submission}
|
|
|
|
\begin{itemize}
|
|
\item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
|
|
\item show results for all available core counts
|
|
\item only display the 1engine tests
|
|
\item show combined total throughput
|
|
\item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
|
|
\end{itemize}
|
|
|
|
\section{Multiple Engines in a Group}
|
|
|
|
\begin{itemize}
|
|
\item assumed from arch spec that multiple engines lead to greater Performance
|
|
\item reason is that page faults and access latency will be overlapped with preparing the next operation
|
|
\item in the given scenario we observe the opposite, slight performance decrease
|
|
\item show multisubmit 50 for both 1e and 4e
|
|
\item maybe remeassure with each submission accessing different memory region?
|
|
\item conclusion?
|
|
\end{itemize}
|
|
|
|
\section{Data Movement from DDR to HBM}
|
|
|
|
\begin{itemize}
|
|
\item present two copy methods: smart and brute force
|
|
\item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket
|
|
\item conclude which option makes more sense (smart)
|
|
\item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively
|
|
\item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force
|
|
\end{itemize}
|
|
|
|
\section{Analysis}
|
|
|
|
\begin{itemize}
|
|
\item summarize the conclusions and define the point at which dsa makes sense
|
|
\item minimum transfer size for batch/nonbatch operation
|
|
\item effect of mtsubmit -> no fairness guarantees
|
|
\item usage of multiple engines -> no effect
|
|
\item smart copy method as the middle-ground between peak throughput and utilization
|
|
\item lower utilization of dsa is good when it will be shared between threads/processes
|
|
\end{itemize}
|
|
|
|
|
|
\cleardoublepage
|
|
|
|
%%% Local Variables:
|
|
%%% TeX-master: "diplom"
|
|
%%% End:
|