rewrite with todos for design chapter

10 months ago · d61284f4f1
1 changed files with 23 additions and 30 deletions
--- a/thesis/content/40_design.tex
+++ b/thesis/content/40_design.tex
@ -18,13 +18,9 @@
 % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
 % sein, daß das Abstraktionsniveau verfehlt wurde.

-In this chapter we design a class interface for use as a general purpose cache. We will present challenges and then detail the solutions employed to face them, finalizing the architecture with each step. Details on the implementation of this blueprint will be omitted, as we discuss a selection of relevant aspects in Chapter \ref{chap:implementation}. \par
+In this chapter, we formulate a class interface for a general-purpose cache. We will outline the requirements and elucidate the solutions employed to address them, culminating in the final architecture. Details pertaining to the implementation of this blueprint will be deferred to Chapter \ref{chap:implementation}, where we delve into a selection of relevant aspects. \par

-\section{Cache Design} \label{sec:design:cache}
-
-The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond \gls{qdp}, the decision was made to address the prefetching in \gls{qdp} by implementing an offloading \texttt{Cache}. Henceforth, when referring to the provided implementation, we will use \texttt{Cache}. \todo{reformulate} \par
-
-The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
+The target application of code contributed by this work is to accelerate \glsentrylong{qdp} by offloading copy operations to the \gls{dsa}. Prefetching is inherently related with cache functionality. Given that an application providing the latter offers a broader scope of utility beyond \gls{qdp}, we opted to implement an offloading \texttt{Cache}. \par

 \begin{figure}[!t]
    \centering
@ -33,45 +29,42 @@ The interface of \texttt{Cache} must provide three basic functions: (1) requesti
    \label{fig:impl-design-interface}
 \end{figure}

-\subsection{Interface}
-
-\todo{mention custom malloc for outsourcing allocation to user}
-\todo{give better intro for this subsection}
+\section{Interface}
+\label{sec:design:interface}

-Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. The user retains control over cache placement and the assignment of tasks to accelerators through mechanisms outlined in \ref{sec:design:accel-usage}. This interface is represented on the right block of Figure \ref{fig:impl-design-interface} labelled \enquote{Cache} and includes some additional operations beyond the basic requirements. \par
+The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due to various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par

-Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. \par
+Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. \par

-To facilitate this process \todo{not only for this, cachedata is also needed to allow async operation}, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}.  Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
+Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}.  Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par

-\subsection{Cache Entry Reuse} \label{subsec:design:cache-entry-reuse}
+\subsection{Policy Functions}
+\label{subsec:design:policy-functions}

-When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. \par
+In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring function pointers performing dynamic memory management be passed.  As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par

-\subsection{Cache Entry Lifetime} \label{subsec:design:cache-entry-lifetime}
+The policy functions receive parameters deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par

-Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
-
-\subsection{Usage Restrictions} \label{subsec:design:restrictions}
+For memory management, two functions are required, providing allocation (\texttt{malloc}) and deallocation (\texttt{free}) capabilities to the \texttt{Cache}. Following the naming scheme, these two functions must adhere to the thread safety guarantees and behaviour set forth by the C++ standard. Most notably, \texttt{malloc} must never return the same address for subsequent or concurrent calls. \par

-The cache, in the context of this work, primarily handles static data. Therefore, two restrictions are placed on the invalidation operation. This decision results in a drastically simpler cache design, as implementing a fully coherent cache would require developing a thread-safe coherence scheme, which is beyond the scope of our work. \par 
+\subsection{Cache Entry Reuse}
+\label{subsec:design:cache-entry-reuse}

-Firstly, overlapping areas in the cache will result in undefined behaviour during the invalidation of any one of them. Only the entries with the equivalent source pointer will be invalidated, while other entries with differing source pointers, which due to their size, still cover the now invalidated region, will remain unaffected. At this point, the cache may or may not continue to contain invalid elements. \par
+When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must therefore also ensure that, on concurrent access to the same resource, only one Thread creates \texttt{CacheData} while the others are provided with a copy. \par

-Secondly, invalidation is a manual process, requiring the programmer to remember which points of data are currently cached and to invalidate them upon modification. No ordering guarantees are provided in this situation, potentially leading to threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
+\subsection{Cache Entry Lifetime}
+\label{subsec:design:cache-entry-lifetime}

-Due to its reliance on libnuma for memory allocation, \texttt{Cache} is exclusively compatible with systems where this library is available. It is important to note that Windows platforms use their own API for this purpose, which is incompatible with libnuma, rendering the code non-executable on such systems \cite{microsoft:numa-malloc}. \par
+Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par

-\todo{remove the libnuma restriction}
-\todo{mention undefined behaviour when writing to cache source before having waited for caching operation completion}
-\todo{mention undefined behaviour through read-after-write pass according to ordering guarantees by dsa}
+\section{Usage Restrictions}
+\label{sec:design:restrictions}

-\section{Accelerator Usage}
-\label{sec:design:accel-usage}
+In the context of this work, the cache primarily manages static data, leading to two restrictions placed on the invalidation operation, allowing significant reductions in design complexity. Firstly, due to the cache's design, overlapping areas in the cache will lead to undefined behaviour during the invalidation of any one of them. Only entries with equivalent source pointers will be invalidated, while other entries with differing source pointers, still covering the now-invalidated region due to their size, will remain unaffected. Consequently, the cache may or may not continue to contain invalid elements. Secondly, invalidation is a manual process, necessitating the programmer to recall which data points are currently cached and to invalidate them upon modification. In this scenario, no ordering guarantees are provided, potentially resulting in threads still holding pointers to now-outdated entries and continuing their progress with this data. \par

-Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is relatively straightforward, thanks in part to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in the cache, work is submitted to the accelerator. However, before proceeding, the desired location for the cache entry must be determined which the user-defined cache placement policy function handles. Once the desired placement is obtained, the copy policy then determines, which nodes should participate in the copy operation. Following Section \ref{subsection:dsa-hwarch}, this is equivalent to selecting the accelerators. The copy tasks are distributed across the participating nodes. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in Chapter \ref{chap:implementation}. \par
+Additionally, the cache inherits some restrictions due to its utilization of the \gls{dsa}. As mentioned in Section \ref{subsubsec:state:ordering-guarantees}, only write-ordering is guaranteed under specific circumstances. Although we configured the necessary parameters for this guarantee in Section \ref{sec:state:setup-and-config}, load balancing over multiple \gls{dsa}s, as described in Section \ref{sec:impl:application}, can introduce scenarios where writes to the same address may be submitted on different accelerators. As the ordering guarantee is provided on only one \gls{dsa}, undefined behaviour can occur in \enquote{multiple-writers}, in addition to the \enquote{read-after-write} scenarios. However, due to the constraints outlined in Section \ref{sec:design:interface}, the \enquote{multiple-writers} scenario is prevented, by ensuring that only one thread can perform the caching task for a given datum. Moreover, the requirement for user-provided memory management functions to be thread-safe (Section \ref{subsec:design:policy-functions}) ensures that two concurrent cache accesses will never receive the same memory region for their task. These two guarantees in conjunction secure the caches' integrity. Hence, the only relevant scenario is \enquote{read-after-write}, which is also accounted for since the cache pointer is updated by \texttt{CacheData::WaitOnCompletion} only when all operations have concluded. Situations where a caching task (read) depending on the results of another task (write) are thereby prevented. \par

-\todo{modify naming, possible outsource policy functions, mention memory policy}
+Despite accounting for the complications in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed through a call to \texttt{CacheData::WaitOnCompletion} will result in an undefined state for the cached region. Therefore, it is imperative to explicitly wait for data present in the cache to avoid such scenarios. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"