% Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
% eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
@ -18,18 +18,17 @@
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde.
\begin{itemize}
\item
\end{itemize}
\todo{write introductory paragraph}
\section{Benchmarking Methodology}
\begin{itemize}
\item
\end{itemize}
\todo{write this section}
\section{Submission Method}
\todo{write this section}
\begin{itemize}
\item submit cost analysis: best method and for a subset the point at which submit cost < time savings
\item display the full opt-submitmethod graph
@ -42,6 +41,8 @@
\section{Multithreaded Submission}
\todo{write this section}
\begin{itemize}
\item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
\item show results for all available core counts
@ -50,8 +51,12 @@
\item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
\end{itemize}
\todo{write this section}
\section{Multiple Engines in a Group}
\todo{write this section}
\begin{itemize}
\item assumed from arch spec that multiple engines lead to greater Performance
\item reason is that page faults and access latency will be overlapped with preparing the next operation
@ -61,7 +66,9 @@
\item conclusion?
\end{itemize}
\section{Data Movement from DDR to HBM}
\section{Data Movement from DDR to HBM}\label{sec:perf:datacopy}
\todo{write this section}
\begin{itemize}
\item present two copy methods: smart and brute force
@ -73,6 +80,8 @@
\section{Analysis}
\todo{write this section}
\begin{itemize}
\item summarize the conclusions and define the point at which dsa makes sense
\item minimum transfer size for batch/nonbatch operation
@ -82,7 +91,6 @@
\item lower utilization of dsa is good when it will be shared between threads/processes
% Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
% eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
@ -18,8 +18,12 @@
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde.
\todo{write introductory paragraph}
\section{Detailed Task Description}
\todo{write this section}
\begin{itemize}
\item give slightly more detailed task Description
\item perspective of "what problems have to be solved"
@ -36,7 +40,7 @@ To allow rapid integration and ease developer workload, a simple interface was c
As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be refered to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par
@ -56,11 +60,11 @@ Due to its reliance on libnuma for numa awareness, \texttt{Cache} will only work
\subsection{Thread Safety Guarantees}
After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{sec:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par
After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{chap:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml}\cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml}\cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{sec:implementation}.
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml}\cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml}\cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}.