diff --git a/thesis/content/02_abstract.tex b/thesis/content/02_abstract.tex index 7eafcc9..d19afb3 100644 --- a/thesis/content/02_abstract.tex +++ b/thesis/content/02_abstract.tex @@ -10,6 +10,8 @@ \ldots abstract \ldots +\todo{write the abstract} + %%% Local Variables: %%% TeX-master: "diplom" %%% End: diff --git a/thesis/content/10_introduction.tex b/thesis/content/10_introduction.tex index c20e592..249b04b 100644 --- a/thesis/content/10_introduction.tex +++ b/thesis/content/10_introduction.tex @@ -1,5 +1,5 @@ \chapter{Introduction} -\label{sec:intro} +\label{chap:intro} % Die Einleitung schreibt man zuletzt, wenn die Arbeit im Großen und % Ganzen schon fertig ist. (Wenn man mit der Einleitung beginnt - ein @@ -12,28 +12,8 @@ % den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr % als 10 Seiten liest keiner. -\section{Introduction to Querry driven Prefetching} -\begin{itemize} - \item database context where we have an execution plan for a querry to be executed - \item use knowledge about the querries sub-tasks to determine part of the table which is worth to cache (used multiple times) - \item refer to whitepaper for more information -\end{itemize} - -\section{Introduction to Intel Data Streaming Accelerator} - -\begin{itemize} - \item -\end{itemize} - -\section{Goal Definition} - -\begin{itemize} - \item use DSA to offload asynchronous prefetching tasks - \item prefetch into HBM which is smaller (cant hold all data) but faster (used as large cache) - \item effect is lower cpu utilization for copy - \item this allows to focus on actual pipeline execution -\end{itemize} +\todo{write this chapter} \cleardoublepage diff --git a/thesis/content/30_performance.tex b/thesis/content/30_performance.tex index 508d9d0..5a6f536 100644 --- a/thesis/content/30_performance.tex +++ b/thesis/content/30_performance.tex @@ -1,5 +1,5 @@ \chapter{Performance Microbenchmarks} -\label{sec:perf} +\label{chap:perf} % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann @@ -18,18 +18,17 @@ % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf % sein, daß das Abstraktionsniveau verfehlt wurde. -\begin{itemize} - \item -\end{itemize} + +\todo{write introductory paragraph} \section{Benchmarking Methodology} -\begin{itemize} - \item -\end{itemize} +\todo{write this section} \section{Submission Method} +\todo{write this section} + \begin{itemize} \item submit cost analysis: best method and for a subset the point at which submit cost < time savings \item display the full opt-submitmethod graph @@ -42,6 +41,8 @@ \section{Multithreaded Submission} +\todo{write this section} + \begin{itemize} \item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared \item show results for all available core counts @@ -50,8 +51,12 @@ \item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness \end{itemize} +\todo{write this section} + \section{Multiple Engines in a Group} +\todo{write this section} + \begin{itemize} \item assumed from arch spec that multiple engines lead to greater Performance \item reason is that page faults and access latency will be overlapped with preparing the next operation @@ -61,7 +66,9 @@ \item conclusion? \end{itemize} -\section{Data Movement from DDR to HBM} +\section{Data Movement from DDR to HBM} \label{sec:perf:datacopy} + +\todo{write this section} \begin{itemize} \item present two copy methods: smart and brute force @@ -73,6 +80,8 @@ \section{Analysis} +\todo{write this section} + \begin{itemize} \item summarize the conclusions and define the point at which dsa makes sense \item minimum transfer size for batch/nonbatch operation @@ -82,7 +91,6 @@ \item lower utilization of dsa is good when it will be shared between threads/processes \end{itemize} - \cleardoublepage %%% Local Variables: diff --git a/thesis/content/40_design.tex b/thesis/content/40_design.tex index d52cbb4..58c59fb 100644 --- a/thesis/content/40_design.tex +++ b/thesis/content/40_design.tex @@ -1,5 +1,5 @@ \chapter{Design} -\label{sec:design} +\label{chap:design} % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann @@ -18,8 +18,12 @@ % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf % sein, daß das Abstraktionsniveau verfehlt wurde. +\todo{write introductory paragraph} + \section{Detailed Task Description} +\todo{write this section} + \begin{itemize} \item give slightly more detailed task Description \item perspective of "what problems have to be solved" @@ -36,7 +40,7 @@ To allow rapid integration and ease developer workload, a simple interface was c As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be refered to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par -\subsection{Cache Entry Reuse} +\subsection{Cache Entry Reuse} \label{subsec:design:cache-entry-reuse} When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par @@ -56,11 +60,11 @@ Due to its reliance on libnuma for numa awareness, \texttt{Cache} will only work \subsection{Thread Safety Guarantees} -After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{sec:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par +After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{chap:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par -\subsection{Accelerator Usage} +\subsection{Accelerator Usage} \label{subsec:implementation:accel-usage} -Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{sec:implementation}. +Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}. \cleardoublepage diff --git a/thesis/content/60_evaluation.tex b/thesis/content/60_evaluation.tex index 6f34dec..4670fcf 100644 --- a/thesis/content/60_evaluation.tex +++ b/thesis/content/60_evaluation.tex @@ -1,5 +1,5 @@ \chapter{Evaluation} -\label{sec:evaluation} +\label{chap:evaluation} % Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus % diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden, @@ -12,6 +12,8 @@ \ldots evaluation \ldots +\todo{write this chapter} + \cleardoublepage %%% Local Variables: