From 1021197009562b10c755e040dabd86c93d1d3230 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Constantin=20F=C3=BCrst?= <c@fuersten.info>
Date: Mon, 5 Feb 2024 00:45:03 +0100
Subject: [PATCH] do not use the float-force parameter H but move to a more
 dynmic one (h!tb), also dont use Subsection anywhere instead use Section, add
 a small paragraph to the implementation chapter stating how we used the cache
 in qdp

---
 thesis/content/20_state.tex          | 12 ++++++------
 thesis/content/40_design.tex         |  4 ++--
 thesis/content/50_implementation.tex | 10 +++++-----
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/thesis/content/20_state.tex b/thesis/content/20_state.tex
index f0dc8e8..e1618e3 100644
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@@ -45,7 +45,7 @@ This chapter introduces the relevant technologies and concepts for this thesis.
     \label{fig:qdp-simple-query}
 \end{figure}
 
-\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory during the execution of the pipeline. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(PROJECT_{b<-a}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par
+\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory during the execution of the pipeline. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par
 
 \section{\glsentrylong{dsa}}
 
@@ -68,7 +68,7 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
 
 \textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
 
-\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
+\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
 
 \textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
 
@@ -88,7 +88,7 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
 \subsection{Software View}
 \label{subsec:state:dsa-software-view}
 
-\begin{figure}[h]
+\begin{figure}[h!tb]
     \centering
     \includegraphics[width=0.5\textwidth]{images/block-dsa-swarch.pdf}
     \caption{\glsentrylong{dsa} Software View \cite[Fig. 1 (b)]{intel:analysis}. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission.}
@@ -104,9 +104,9 @@ With some limitations, like lacking support for \gls{dsa:dwq} submission, this l
 \section{Programming Interface for \glsentrylong{dsa}}
 \label{sec:state:dml}
 
-As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
+As mentioned in Section \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
 
-\begin{figure}[h]
+\begin{figure}[h!tb]
     \centering
     \includegraphics[width=0.9\textwidth]{images/nsd-dsamemcpy.pdf}
     \caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (Intel \glsentryshort{dsa}) or software (CPU).}
@@ -117,7 +117,7 @@ In the function header of Figure \ref{fig:dml-memcpy} two differences from stand
 
 The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par
 
-Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Subsection \ref{subsection:dsa-hwarch}, the node ID is equivalent to the ID of the \gls{dsa}. \par
+Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Section \ref{subsection:dsa-hwarch}, the node ID is equivalent to the ID of the \gls{dsa}. \par
 
 \gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par
 
diff --git a/thesis/content/40_design.tex b/thesis/content/40_design.tex
index c91f8ba..70a2646 100644
--- a/thesis/content/40_design.tex
+++ b/thesis/content/40_design.tex
@@ -26,7 +26,7 @@ The task of prefetching is somewhat aligned with that of a cache. As a cache is
 
 The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating an update either from the source or vice versa. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
 
-\begin{figure}[H]
+\begin{figure}[h!tb]
     \centering
     \includegraphics[width=0.9\textwidth]{images/uml-cache-and-cachedata.pdf}
     \caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
@@ -62,7 +62,7 @@ Due to its reliance on libnuma for memory allocation, \texttt{Cache} is exclusiv
 \section{Accelerator Usage}
 \label{sec:design:accel-usage}
 
-Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is relatively straightforward, thanks in part to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in the cache, work is submitted to the accelerator. However, before proceeding, the desired location for the cache entry must be determined which the user-defined cache placement policy function handles. Once the desired placement is obtained, the copy policy then determines, which nodes should participate in the copy operation. Following Subsection \ref{subsection:dsa-hwarch}, this is equivalent to selecting the accelerators. The copy tasks are distributed across the participating nodes. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in Chapter \ref{chap:implementation}. \par
+Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is relatively straightforward, thanks in part to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in the cache, work is submitted to the accelerator. However, before proceeding, the desired location for the cache entry must be determined which the user-defined cache placement policy function handles. Once the desired placement is obtained, the copy policy then determines, which nodes should participate in the copy operation. Following Section \ref{subsection:dsa-hwarch}, this is equivalent to selecting the accelerators. The copy tasks are distributed across the participating nodes. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in Chapter \ref{chap:implementation}. \par
 
 %%% Local Variables:
 %%% TeX-master: "diplom"
diff --git a/thesis/content/50_implementation.tex b/thesis/content/50_implementation.tex
index 93abb1c..d538311 100644
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@@ -28,7 +28,7 @@ The usage of locking and atomics has proven to be challenging. Their use is perf
 
 \subsection{Cache State Lock} \label{subsec:implementation:cache-state-lock}
 
-To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in Subsection \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
+To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
 
 A map-datastructure was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other \gls{numa:node}s. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per \gls{numa:node} state, also allowing other \gls{numa:node} to progress.\par
 
@@ -44,7 +44,7 @@ Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the i
 
 Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
 
-\begin{figure}[H]
+\begin{figure}[h!tb]
     \centering
     \includegraphics[width=0.9\textwidth]{images/seq-blocking-wait.pdf}
     \caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\) and \(T_3\). Then \(T_1\) holds the handlers exclusively, leading to the other threads having to wait for \(T_1\) to perform the work submission and waiting before they can access the datum through the cache.}
@@ -55,9 +55,9 @@ Due to the possibility of access by multiple threads, the implementation of \tex
 
 To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\)  is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
 
-\begin{figure}[H]
+\begin{figure}[h!tb]
     \centering
-    \includegraphics[width=0.9\textwidth]{images/nsd-cachedata-waitoncompletion.pdf}
+    \includegraphics[width=1.0\textwidth]{images/nsd-cachedata-waitoncompletion.pdf}
     \caption{\texttt{CacheData::WaitOnCompletion} Pseudocode. Final rendition of the implementation for a fair wait function.}
     \label{fig:impl-cachedata-waitoncompletion}
 \end{figure}
@@ -105,7 +105,7 @@ After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cach
 
 \section{Application to \glsentrylong{qdp}}
 
-\todo{write this section or consider putting it in evaluation}
+Applying the \texttt{Cache} to \gls{qdp} is straightforward. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, calling \texttt{Cache::Access} for both prefetching and cache access. \par
 
 %%% Local Variables:
 %%% TeX-master: "diplom"