write section 4.2 on cache design

11 months ago · f910efd7e8
6 changed files with 48 additions and 10 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/10_introduction.tex
+++ b/thesis/content/10_introduction.tex
@ -15,19 +15,22 @@
 \section{Introduction to Querry driven Prefetching}

 \begin{itemize}
-    \item
+    \item database context where we have an execution plan for a querry to be executed
+    \item use knowledge about the querries sub-tasks to determine part of the table which is worth to cache (used multiple times)
+    \item refer to whitepaper for more information
 \end{itemize}

 \section{Introduction to Intel Data Streaming Accelerator}

 \begin{itemize}
-    \item
+    \item 
 \end{itemize}

 \section{Goal Definition}

 \begin{itemize}
    \item use DSA to offload asynchronous prefetching tasks
+    \item prefetch into HBM which is smaller (cant hold all data) but faster (used as large cache)
    \item effect is lower cpu utilization for copy
    \item this allows to focus on actual pipeline execution
 \end{itemize}
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -39,7 +39,7 @@ Introduced with the 4th generation of Intel Xeon Scalable Processors \cite{intel

 To be able to optimally utilize the Hardware, knowledge of its workings is required to make educated decisions. Therefore, this section describes both the workings of the \gls{dsa} engine itself and the view that is presented through software interfaces. All statements are based on Chapter 3 of the Architecture Specification by Intel \cite{intel:dsaspec}. \par

-\subsection{Hardware Architecture}
+\subsection{Hardware Architecture} \label{subsection:dsa-hwarch}

 \begin{figure}[H]
    \centering
--- a/thesis/content/40_design.tex
+++ b/thesis/content/40_design.tex
@ -18,7 +18,6 @@
 % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
 % sein, daß das Abstraktionsniveau verfehlt wurde.

-
 \section{Detailed Task Description}

 \begin{itemize}
@ -27,13 +26,41 @@
    \item not "what is querry driven prefetching"
 \end{itemize}

-\section{Design Choices}
+\section{Cache Design}

-\begin{itemize}
-    \item explain the design choices made to solve the problems
-    \item this should go into theoretical details - no code
-    \item when we copy, how we submit, who submits, which DSA are used
-\end{itemize}
+The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond Query Driven Prefetching, the choice was made to solve the prefetching offload by implementing an offloading \texttt{Cache}. When refering to the provided implementation, \texttt{Cache} will be used from now on. The interface with \texttt{Cache} must provide three basic functions: requesting a memory block to be cached, accessing a cached memory block and synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, requiring the entry to be updated with the source or the other way around. Due to the many possible setups and use cases, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible while being removed when system memory pressure due to restrictive memory size drives the \texttt{Cache} to flush unused entries. \par
+
+\subsection{Interface}
+
+To allow rapid integration and ease developer workload, a simple interface was chosen. As this work primarily focuses on caching static data, the choice was made only to provide cache invalidation and not synchronization. Given a memory address, \texttt{Cache::Invalidate} will remove all entries for it. The other two operations are provided in one single function, which we shall call \texttt{Cache::Access} henceforth, receiving a data pointer and size it takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. The cache placement and assignment of the task to accelerators are controlled by the user. In addition to the two basic operations outlined before, the user also is given the option to flush the cache using \texttt{Cache::Flush} of unused elements manually or to clear it completely with \texttt{Cache::Clear}. \par
+
+As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be refered to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par
+
+\subsection{Cache Entry Reuse}
+
+When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par
+
+\subsection{Cache Entry Lifetime}
+
+By allowing multiple references to the same entry, memory management becomes a concern. Freeing the allocated block must only take place when all copies of a \texttt{CacheData} instance are destroyed, therefore tying cache entry lifetime to the lifetime of the longest living copy of the original instance. This makes access to the entry legal during the lifetime of any \texttt{CacheData} instance, while also guaranteeing that \texttt{Cache::Clear} will not have any unforseen side effects, as deallocation only takes place when the last consumer has \texttt{CacheData} go out of scope or manually deletes it. \par
+
+\subsection{Usage Restrictions}
+
+As cache invalidation applies mainly to non-static data which this work does not focus on, two restrictions are placed on the invalidation operation. This permits drastically simpler cache design, as a fully coherent cache would require developing a thread safe coherence scheme which is outside our scope. \par 
+
+Firstly, overlapping areas in the cache will cause undefined behaviour during invalidation of any one of them. Only the entries with the equivalent source data pointer will be invalidated, while other entries with differing source pointers which, due to their size, still cover the now invalidated region, will not be invalidated and therefore the cache may and may continue to contain invalid elements at this point. \par
+
+Secondly, invalidation is to be performed manually, requiring the programmer to remember which points of data are at any given point in time cached and invalidating them upon modification. No ordering guarantees will be given for this situation, possibly leading to threads still having a pointer to now-outdated  entries and continuing their progress with this. \par
+
+Due to its reliance on libnuma for numa awareness, \texttt{Cache} will only work on systems where this library is present, excluding, most notably, Windows from the compatibility list. \par
+
+\subsection{Thread Safety Guarantees}
+
+After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{sec:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par
+
+\subsection{Accelerator Usage}
+
+Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{sec:implementation}.

 \cleardoublepage

--- a/thesis/own.gls
+++ b/thesis/own.gls
@ -87,4 +87,11 @@
    long={Process Address Space ID},
    first={Process Address Space ID (PASID)},
    description={... desc ...}
+}
+
+\newglossaryentry{intel:dml}{
+    name={Intel DML},
+    long={Intel Data Mover Library},
+    first={Intel Data Mover Library (Intel DML)},
+    description={... desc ...}
 }
--- a/thesis/preamble/packages.tex
+++ b/thesis/preamble/packages.tex
@ -35,6 +35,7 @@
 \usepackage{lastpage}           % enables the usage of the label "LastPage" to get the
                                % number of pages with \pageref{LastPage}
 \usepackage[nopostdot,nonumberlist]{glossaries}
+\usepackage[all]{nowidow}

 % use this one last
 % (redefines some macros for compatibility with KOMAScript)