finalize first stage of reformulation and begin configuring layout for proper figure position and typography

3 months ago · 0c8727b807
9 changed files with 81 additions and 71 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/bachelor.tex
+++ b/thesis/bachelor.tex
@ -75,6 +75,10 @@ plainpages=false,pdfpagelabels=true]{hyperref}

 \pagenumbering{arabic}

+\clubpenalty = 10000 % Schusterjungen verhindern
+\widowpenalty = 10000 % Hurenkinder verhindern
+\displaywidowpenalty = 10000
+
 % use \input for small stuff (like a list you include twice or a tiks figure)
 % and \include for large latex compilation workloads (like a chapter) to get faster builds.
 \include{content/10_introduction}
--- a/thesis/content/02_abstract.tex
+++ b/thesis/content/02_abstract.tex
@ -8,7 +8,7 @@
 % geben (für irgendetwas müssen die Betreuer ja auch noch da
 % sein).

-This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM) and High Bandwidth Memory (HBM). Systems equipped with more than one type of main memory or employing a Non-Uniform Memory Architecture (NUMA) necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the \gls{dsa}.
+This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM) and High Bandwidth Memory (HBM). Systems equipped with more than one type of main memory or employing a Non-Uniform Memory Architecture (NUMA) necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the DSA.

 %%% Local Variables:
 %%% TeX-master: "diplom"
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -29,7 +29,7 @@
 % Kapitel wird in der Regel zuerst geschrieben und ist das Einfachste
 % (oder das Schwerste weil erste).

-This chapter introduces the relevant technologies and concepts for this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.
+This chapter introduces the technologies and concepts, relevant for the understanding of this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.

 \section{\glsentrylong{hbm}}
 \label{sec:state:hbm}
@ -55,7 +55,7 @@ This chapter introduces the relevant technologies and concepts for this thesis.

 \gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par

-Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks. Therefore, a high degree of concurrency may be observed, resulting in demand for CPU cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. For this reason, we intend to offload copy operations to the \gls{dsa} in this work, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \cite{dimes-prefetching} \par
+Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks, resulting in a high degree of concurrency. This increases demand for processing cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. Hence, our objective in this work is to offload the underlying copy operations to the \gls{dsa}, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \cite{dimes-prefetching} \par

 \section{\glsentrylong{dsa}}
 \label{sec:state:dsa}
@ -77,9 +77,9 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
 \subsubsection{Architectural Components}
 \label{subsec:state:dsa-arch-comp}

-\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. The method used to achieve this guarantee may result in higher submission cost \cite[Sec. 3.3.1]{intel:dsaspec}, compared to the \gls{dsa:dwq} to which a descriptor is submitted via a regular write \cite[Sec. 3.3.2]{intel:dsaspec}. \par
+\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions to which a descriptor is written, facilitating task submission. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes. The method used to achieve this guarantee may result in higher submission cost \cite[Sec. 3.3.1]{intel:dsaspec}, compared to the \gls{dsa:dwq} to which a descriptor is submitted via a regular write \cite[Sec. 3.3.2]{intel:dsaspec}. \par

-\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
+\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation is contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par

 \textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par

@ -108,7 +108,7 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
    \label{fig:dsa-software-arch}
 \end{figure}

-Since the Linux Kernel version 5.10, a driver for the \gls{dsa} has been available, which currently lacks a counterpart on Windows Operating Systems \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
+At last, we will give an overview of the available software stack for interacting with the \gls{dsa}. Driver support is limited to Linux, where since Kernel version 5.10, a driver for the \gls{dsa} has been available \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa} is impossible under other operating systems. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application and library \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}, while also facilitating hardware discovery. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par

 While a process could theoretically submit work to the \gls{dsa} by manually preparing descriptors and submitting them via special instructions, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par

@ -120,21 +120,13 @@ As mentioned in Section \ref{subsec:state:dsa-software-view}, \gls{intel:dml} of
 \begin{figure}[!t]
    \centering
    \includegraphics[width=0.9\textwidth]{images/nsd-dsamemcpy.pdf}
-    \caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (Intel \glsentryshort{dsa}) or software (CPU).}
+    \caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}. The template parameter \texttt{path} selects between hardware offloading (Intel \glsentryshort{dsa}) or software execution (CPU).}
    \label{fig:dml-memcpy}
 \end{figure}

-In the function header of Figure \ref{fig:dml-memcpy} two differences from standard memcpy are notable. Firstly, there is the template parameter named \texttt{path}, and secondly, an additional parameter \texttt{int node}. Both will be discussed in the following paragraphs. \par
+In the function header of Figure \ref{fig:dml-memcpy} two differences from standard memcpy are notable. Firstly, there is the template parameter named \texttt{path}, and secondly, an additional parameter \texttt{int node}. The \texttt{path} allows selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favouring \gls{dsa} where available \cite[Sec. Quick Start]{intel:dmldoc}. Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or, or by using optional parameters of \texttt{dml::submit} \cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on \gls{numa:node} chosen by \texttt{int node}. With it only being an example, potential side effects, arising from modification of \glsentryshort{numa}-assignment, of calling this pseudocode are not relevant. \par

-The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par
-
-Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or, or by using optional parameters of \texttt{dml::submit} \cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on the given node. With it only being an example, potential side effects, arising from modification of \glsentryshort{numa}-assignment, of calling this pseudocode are not relevant. \par
-
-\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par
-
-In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par
-
-A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}. \par
+\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. Following this preparation, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. For submission, the function \texttt{dml::submit<path>} is used, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}. \par

 \section{System Setup and Configuration} \label{sec:state:setup-and-config}

--- a/thesis/content/30_performance.tex
+++ b/thesis/content/30_performance.tex
@ -28,7 +28,9 @@ Timing in the outer loop may display lower throughput than actual. This is the c
    \label{fig:benchmark-function:outer}
 \end{figure}

-To get accurate results, the benchmark is repeated \(10\) times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. The chosen internal repetition count is \(10.000\) for transfers in the range of \(1-8\ KiB\), \(1.000\) for \(1\ MiB\) and one for larger instances. \par
+\todo{add more references to figure}
+
+To get accurate results, the benchmark is repeated \(10\) times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in under \(100\ ms\). This short execution window can have adverse effects on the timings. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. The chosen internal repetition count is \(10.000\) for transfers in the range of \(1-8\ KiB\), \(1.000\) for \(1\ MiB\) and one for larger instances. \par

 \begin{figure}[!t]
    \centering
@ -37,7 +39,9 @@ To get accurate results, the benchmark is repeated \(10\) times. Each iteration
    \label{fig:benchmark-function:inner}
 \end{figure}

-For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par
+\todo{add more references to figure}
+
+For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This selection is displayed in Figure \ref{fig:benchmark-function:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par

 \section{Benchmarks}
 \label{sec:perf:bench}
@ -58,7 +62,7 @@ We anticipate that single submissions will consistently yield poorer performance
    \label{fig:perf-submitmethod}
 \end{figure}

-In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of \(1\ MiB\) and upwards, the cost of single submission drops. As there is still a slight difference, datum size should be even larger. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. observed that \enquote{SWQ observes lower throughput between \(1-8\ KB\) [transfer size]} \cite[pp. 6]{intel:analysis}. We however observe a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to \(30\ GiB/s\). This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
+In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of \(1\ MiB\) and upwards, the cost of single submission drops. As there is still a slight difference, datum size should be even larger. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. noted that \enquote{SWQ observes lower throughput between \(1-8\ KB\) [transfer size]} \cite[pp. 6]{intel:analysis}. We however measured a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. Another limitation is visible in our result, namely the inherent throughput limit per \gls{dsa} chip of close to \(30\ GiB/s\). This is caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par

 \subsection{Multithreaded Submission}
 \label{subsec:perf:mtsubmit}
@ -77,19 +81,13 @@ In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible neg
 \subsection{Data Movement from \glsentryshort{dram} to \glsentryshort{hbm}}
 \label{subsec:perf:datacopy}

-Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. With \gls{hbm} offering higher bandwidth than the \glsentryshort{dram} of our system, we will be restricted by the available bandwidth of the source. To determine the upper limit achievable, we must calculate the available peak bandwidth. For each \gls{numa:node}, the test system is configured with two DIMMs of DDR5-4800. The naming scheme contains the data rate in Megatransfers per second, however the processor specification notes that, for dual channel operation, the maximum supported speed drops to \(4400\ MT/s\) \cite{intel:xeonmax-ark}. We calculate the transfers performed per second for one \gls{numa:node}, followed by the bytes per transfer \cite{kingston:ddr5-spec-overview} and at last combine these two for the theoretical peak bandwidth per \gls{numa:node} on the system.  \par
-
-\[2\ DIMM \times \frac{4400\ MT}{s\ \times\ DIMM} = 8800\ MT/s\]
-
-\[\frac{64b}{8b/B}\ /\ T = 8\ B/T\]
+Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. With \gls{hbm} offering higher bandwidth than the \glsentryshort{dram} of our system, we will be restricted by the available bandwidth of the source. To determine the upper limit achievable, we must calculate the available peak bandwidth. For each \gls{numa:node}, the test system is configured with two DIMMs of DDR5-4800. The naming scheme contains the data rate in Megatransfers (MT) per second, however the processor specification notes that, for dual channel operation, the maximum supported speed drops to \(4400\ MT/s\) \cite{intel:xeonmax-ark}. We calculate the transfers performed per second for one \gls{numa:node} (1), followed by the bytes per transfer \cite{kingston:ddr5-spec-overview} in calculation (2), and at last combine these two for the theoretical peak bandwidth per \gls{numa:node} on the system (3). \par

-\[8800\ MT/s \times 8B/T = 70400 \times 10^6 B/s = 65.56\ GiB/s\]
-
-From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56/ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
-
-To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
-
-For this benchmark, we transfer \(1\ GiB\)ibyte of data from \gls{numa:node} 0 to the destination \gls{numa:node}. We present data for \gls{numa:node}s 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the \gls{numa:node} IDs of the configured systems and the corresponding storage technology. \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, making it the physically closest possible destination. \gls{numa:node} 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. \gls{numa:node}s 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
+\begin{align}
+    2\ DIMM \times \frac{4400\ MT}{s\ \times\ DIMM} &= 8800\ MT/s \\        
+    \frac{64b}{8b/B}\ /\ T &= 8\ B/T \\
+    8800\ MT/s \times 8B/T = 70400 \times 10^6 B/s &= 65.56\ GiB/s
+\end{align}

 \begin{figure}[!t]
    \centering
@ -124,12 +122,16 @@ For this benchmark, we transfer \(1\ GiB\)ibyte of data from \gls{numa:node} 0 t
    \label{fig:perf-dsa}
 \end{figure}

+From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56\ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
+
+To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
+
+For this benchmark, we transfer \(1\ GiB\)ibyte of data from \gls{numa:node} 0 to the destination \gls{numa:node}. We present data for \gls{numa:node}s 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the \gls{numa:node} IDs of the configured systems and the corresponding storage technology. \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, making it the physically closest possible destination. \gls{numa:node} 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. \gls{numa:node}s 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
+
 We begin by examining the common behaviour of load balancing techniques depicted in Figure \ref{fig:perf-dsa}. The real-world peak throughput of \(64\ GiB/s\) approaches the calculated available bandwidth. In Figure \ref{fig:perf-dsa:1}, a notable hard bandwidth limit is observed, just below the \(30\ GiB/s\) mark, reinforcing what was encountered in Section \ref{subsec:perf:submitmethod}: a single \gls{dsa} is constrained by I/O-Fabric limitations. \par

 Unexpected throughput differences are evident for all configurations, except the bandwidth-bound single \gls{dsa}. Notably, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11. As \gls{numa:node} 8 serves as the \gls{hbm} accessor for the data source \gls{numa:node}, it should have the shortest data path. This suggests that the \gls{dsa} may suffer from sharing parts of the data path for reading and writing. Another interesting observation is that, contrary to our assumption, the physically more distant \gls{numa:node} 15 achieves higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine this behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par

-For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par
-
 \begin{figure}[!t]
    \centering
    \begin{subfigure}[t]{0.35\textwidth}
@ -142,24 +144,21 @@ For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-
    \begin{subfigure}[t]{0.35\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/plot-dsa-throughput-scaling.pdf}
-        \caption{Scaling Factor for different amounts of participating \gls{dsa}. Calculated by dividing the throughput for a configuration by the throughput achieved with one \gls{dsa}. Linear scaling is desirable to not waste resources, therefore only the configurations with one and two \gls{dsa}s are determined to be effective.}
+        \caption{Scaling Behaviour for different amounts of participating \gls{dsa}. Displays factor of performance from utilizing one \gls{dsa}.}
        \label{fig:perf-dsa-analysis:scaling}
    \end{subfigure}
    \caption{Scalability Analysis for different amounts of participating \gls{dsa}s. Displays the average throughput and the derived scaling factor. Shows that, although the throughput does increase with adding more accelerators, beyond two, the gained speed drops significantly. Calculated over the results from Figure \ref{fig:perf-dsa} and therefore applies to copies from \glsentryshort{dram} to \glsentryshort{hbm}.}
    \label{fig:perf-dsa-analysis}
 \end{figure}

+For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par
+
 When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par

 From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par

 The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par

-\subsection{Data Movement using CPU}
-\label{subsec:perf:cpu-datacopy}
-
-For evaluating CPU copy performance we use the benchmark code from the previous Section (Section \ref{subsec:perf:datacopy}), selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par
-
 \begin{figure}[!t]
    \centering
    \begin{subfigure}[t]{0.35\textwidth}
@ -179,6 +178,11 @@ For evaluating CPU copy performance we use the benchmark code from the previous
    \label{fig:perf-cpu}
 \end{figure}

+\subsection{Data Movement using CPU}
+\label{subsec:perf:cpu-datacopy}
+
+For evaluating CPU copy performance we use the benchmark code from the previous Section (Section \ref{subsec:perf:datacopy}), selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par
+
 As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation, validating the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as observed in Section \ref{subsec:perf:datacopy}, can be seen in Figure \ref{fig:perf-cpu}. This points to an architectural anomaly which we could not explain with our knowledge or benchmarks. Further benchmarks were conducted for this, not yielding conclusive results, as the source and copy-thread location did not seem to affect the observed speed delta. \par

 \section{Analysis}
@ -190,9 +194,11 @@ In this section we summarize the conclusions from the performed benchmarks, outl
    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\). Otherwise, offloading might prove more expensive than performing the copy on CPU.
    \item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
    \item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
-    \item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To still utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding lead us to implement round-robin balancing in Section \ref{sec:impl:application}.
+    \item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To still utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding led us to implement round-robin balancing in Section \ref{sec:impl:application}.
 \end{itemize}

+\pagebreak
+
 Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computation. \par

 We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in determining the optimal approach for optimal utilization of the \gls{dsa}. \par
--- a/thesis/content/40_design.tex
+++ b/thesis/content/40_design.tex
@ -22,21 +22,25 @@ In this chapter, we formulate a class interface for a general-purpose cache. We

 The target application of code contributed by this work is to accelerate \glsentrylong{qdp} by offloading copy operations to the \gls{dsa}. Prefetching is inherently related with cache functionality. Given that an application providing the latter offers a broader scope of utility beyond \gls{qdp}, we opted to implement an offloading \texttt{Cache}. \par

-\begin{figure}[!t]
+\begin{figure}[!b]
    \centering
    \includegraphics[width=0.9\textwidth]{images/uml-cache-and-cachedata.pdf}
    \caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
    \label{fig:impl-design-interface}
 \end{figure}

+\todo{add more references to the figure}
+
+\pagebreak
+
 \section{Interface}
 \label{sec:design:interface}

-The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due to various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
+The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation is required when the data that is cached may also be modified, necessitating synchronization. Due to various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when the lack of free cache memory demands it. \par

 Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. \par

-Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
+Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an class referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par

 \subsection{Policy Functions}
 \label{subsec:design:policy-functions}
@ -73,6 +77,8 @@ In the context of this work, the cache primarily manages static data, leading to

 Additionally, the cache inherits some restrictions due to its utilization of the \gls{dsa}. As mentioned in Section \ref{subsubsec:state:ordering-guarantees}, only write-ordering is guaranteed under specific circumstances. Although we configured the necessary parameters for this guarantee in Section \ref{sec:state:setup-and-config}, load balancing over multiple \gls{dsa}s, as described in Section \ref{sec:impl:application}, can introduce scenarios where writes to the same address may be submitted on different accelerators. As the ordering guarantee is provided on only one \gls{dsa}, undefined behaviour can occur in \enquote{multiple-writers}, in addition to the \enquote{read-after-write} scenarios. However, due to the constraints outlined in Section \ref{sec:design:interface}, the \enquote{multiple-writers} scenario is prevented, by ensuring that only one thread can perform the caching task for a given datum. Moreover, the requirement for user-provided memory management functions to be thread-safe (Section \ref{subsec:design:policy-functions}) ensures that two concurrent cache accesses will never receive the same memory region for their task. These two guarantees in conjunction secure the caches' integrity. Hence, the only relevant scenario is \enquote{read-after-write}, which is also accounted for since the cache pointer is updated by \texttt{CacheData::WaitOnCompletion} only when all operations have concluded. Situations where a caching task (read) depending on the results of another task (write) are thereby prevented. \par

+\todo{consider listing possible hazards before dropping them out of thin air here}
+
 Despite accounting for the complications in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed through a call to \texttt{CacheData::WaitOnCompletion} will result in an undefined state for the cached region. Therefore, it is imperative to explicitly wait for data present in the cache to avoid such scenarios. \par

 %%% Local Variables:
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -20,11 +20,11 @@
 % nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
 % auf denen man sie sich ansehen kann.

-In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Finally, we apply the cache to \glsentrylong{qdp}, detailing the policies mentioned in Section \ref{subsec:design:policy-functions} and presenting solutions for the challenges encountered.\par
+In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. First, we delve into the usage of locking and atomics to achieve thread safety and then apply the cache to \glsentrylong{qdp}, thereby detailing the policies mentioned in Section \ref{subsec:design:policy-functions} and presenting solutions for the challenges encountered.\par

 \section{Synchronization for Cache and CacheData}

-The usage of locking and atomics to achieve safe concurrent access has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will extensively focus on the details of their implementation. \par
+The usage of locking and atomics to achieve safe concurrent access has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will focus on the synchronization techniques employed. \par

 Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. Use of a handler is also displayed in the \texttt{memcpy}-function for the \gls{dsa} as shown in Figure \ref{fig:dml-memcpy}. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par

@ -59,6 +59,8 @@ Upon call to \texttt{Cache::Access}, coordination must again take place to only

 In the first implementation, a thread would check if the handlers are available and atomically wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\)  is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see \texttt{nullptr} for the cache pointer and handlers. Therefore, \(T_2\) and \(T_3\) wait on the cache pointer becoming valid (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Then \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait on the cache pointer. Therefore, only \(T_1\) can trigger the waiting on the handlers and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\). \par

+\pagebreak
+
 \begin{figure}[!t]
    \centering
    \includegraphics[width=1.0\textwidth]{images/nsd-cachedata-waitoncompletion.pdf}
@ -66,7 +68,7 @@ In the first implementation, a thread would check if the handlers are available
    \label{fig:impl-cachedata-waitoncompletion}
 \end{figure}

-As a solution for this, a more intricate implementation is required, for which Figure \ref{fig:impl-cachedata-waitoncompletion} shows pseudocode. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now wait on the handlers-pointer to change. When \(T_1\) supplies the handlers after submitting work, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer. Through this, the exclusion that was observable in the first implementation is already avoided. The handlers will be atomically set to a valid pointer and a thread may pass the wait. Following this, the handlers-pointer is atomically exchanged, invalidating it and assigning the previous valid value to a local variable. Each thread checks whether it has received a valid pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer and the designated master-thread. The master is tasked with waiting, while other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again, leading to a wait on the master thread setting the cache to a valid value. \par
+To resolve this, a more intricate implementation is required, for which Figure \ref{fig:impl-cachedata-waitoncompletion} shows pseudocode. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now wait on the handlers-pointer to change. When \(T_1\) supplies the handlers after submitting work, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer. Through this, the exclusion that was observable in the first implementation is already avoided. The handlers will be atomically set to a valid pointer and a thread may pass the wait. Following this, the handlers-pointer is atomically exchanged, invalidating it and assigning the previous valid value to a local variable. Each thread checks whether it has received a valid pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer and the designated master-thread. The master is tasked with waiting, while other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again, leading to a wait on the master thread setting the cache to a valid value. \par

 \subsection{CacheData: Edge Cases and Deadlocks}
 \label{subsec:impl:cachedata-deadlocks}
--- a/thesis/content/60_evaluation.tex
+++ b/thesis/content/60_evaluation.tex
@ -15,9 +15,9 @@ In this chapter, we establish anticipated outcomes for incorporating the develop
 \section{Benchmarked Task}
 \label{sec:eval:bench}

-The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at \(4\ GiB\). The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching for 5 iterations with 5 previous warm-up runs, and form the average. \par
+The benchmark involves the execution of a simple query, which, as depicted in the execution plan in Figure \ref{fig:qdp-simple-query}, is divided into multiple processing stages. We will use the following notations for the three tasks performed in executing the query: (1) \(SCAN_a\), responsible for scanning and subsequently filtering column \texttt{a}; (2) \(SCAN_b\), tasked with prefetching column \texttt{b}; (3) \(AGGREGATE\), the projection and final summation step. The column size utilized is set at \(4\ GiB\). The workload is distributed across multiple groups, with each group spawning threads for every step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-task duration and cache hit ratio for prefetching over five iterations, forming the average. This is preceded by five warm-up runs, for which no timings are recorded. \par

-The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. This burst-submission could cause the \gls{dsa}s work queue to overrun, leading us to increase the size of the blocks for the benchmarks utilizing \gls{qdp} to avoid this. \par
+The tasks \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). This burst-submission could cause the \gls{dsa}s work queue to overflow, leading us to increase the size of the processed chunks for the benchmarks utilizing \gls{qdp} to reduce the amount of cached blocks. \par

 \section{Expectations}
 \label{sec:eval:expectations}
@ -27,23 +27,6 @@ The simple query presents a challenging scenario for the \texttt{Cache}. The exe
 \section{Observations}
 \label{sec:eval:observations}

-In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influences performance \cite{dimes-prefetching}, which however is out of scope for this work. Therefore, results shown are for the best configurations measured. \par
-
-\subsection{Benchmarks without Prefetching}
-
-We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
-
-\begin{table}[!t]
-    \centering
-    \input{tables/table-qdp-baseline.tex}
-    \caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching.}
-    \label{table:qdp-baseline}
-\end{table}
-
-From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different pipeline stages. \par
-
-The following plots are normalized so that the longest execution from Figures \ref{fig:timing-comparison} and \ref{fig:timing-results} fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-pipeline timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and Figures may be observed. \par
-
 \begin{figure}[!t]
    \centering
    \begin{subfigure}[!t]{0.45\textwidth}
@ -63,6 +46,23 @@ The following plots are normalized so that the longest execution from Figures \r
    \label{fig:timing-comparison}
 \end{figure}

+\begin{table}[!t]
+    \centering
+    \input{tables/table-qdp-baseline.tex}
+    \caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline. \glsentryshort{hbm} presents the upper boundary achievable, as it simulates prefetching without processing overhead and delay.}
+    \label{table:qdp-baseline}
+\end{table}
+
+In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per processing stage and the amount of groups influences performance \cite{dimes-prefetching}, which however is out of scope for this work. Therefore, results shown are for the best configurations measured. \par
+
+The plots for detailed timing are normalized so that the longest running configuration fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-task timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and the Figures may be observed. \par
+
+\subsection{Benchmarks without Prefetching}
+
+We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
+
+From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different stages of the query execution plan. \par
+
 Due to the higher bandwidth provided by \gls{hbm} for \(AGGREGATE\), the CPU waits less for data from main memory, thereby improving processing times. This is evident in the overall shorter time taken for \(AGGREGATE\) in Figure \ref{fig:timing-comparison:upplimit} compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with aggregate requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par

 \subsection{Benchmarks using Prefetching}
@ -76,7 +76,7 @@ To address the challenges posed by sharing memory bandwidth between both \(SCAN\
    \label{table:qdp-speedup}
 \end{table}

-We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (called \enquote{HBM (Upper Limit)} in the same Table). This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par
+We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the threads processing \(SCAN_a\), while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (called \enquote{HBM (Upper Limit)} in the same Table). This confirms our assumption that \(SCAN_a\) itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-task timings. \par

 \begin{figure}[!t]
    \centering
--- a/thesis/content/70_conclusion.tex
+++ b/thesis/content/70_conclusion.tex
@ -18,15 +18,15 @@

 In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par

-Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that work was completed with the most current version. \par
+Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that a threads work was performed on current data. \par

 In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par

-As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, currently only supporting busy-waiting. This lead us to extend the design by implement weak-waiting in Section \ref{sec:impl:application}, favouring cache misses instead of wasting resources during the wait. Additionally, access through a \glsentrylong{dsa:dwq} has the potential to reduce submission cost and thereby increase the caches' effectiveness. \par
+As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage features of the \gls{dsa} not implemented in the library. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, currently only supporting busy-waiting. This lead us to extend the design by implement weak-waiting in Section \ref{sec:impl:application}, favouring cache misses instead of wasting resources during the wait. Additionally, access through a \glsentrylong{dsa:dwq} has the potential to reduce submission cost and thereby increase the \texttt{Caches'} effectiveness. \par

-Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for replicating data to \gls{nvram} and might prove applicable to different use cases. Additional benchmarks on more complex queries for \gls{qdp} and a comparison between prefetching to \gls{hbm} using knowledge of the coming queries and the data they access, and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could yield deeper insights into the caches' performance. \par
+Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Given that these conditions align with those typically found in \gls{numa}-optimized applications, such a prerequisite is not unrealistic to expect. The utility of the \texttt{Cache} is not limited to prefetching alone; it offers a solution for replicating data to or from \gls{nvram} and might prove applicable to other use cases. Additional benchmarks on more complex queries for \gls{qdp} and a comparison between prefetching to \gls{hbm} and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could yield deeper insights into the \texttt{Caches'} performance. \par

-In conclusion, the \texttt{Cache}, together with the Sections on the \gls{dsa}'s architecture (Section \ref{sec:state:dsa}) and performance characteristics (Section \ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par
+In conclusion, the developed library together with our exploration of architecture and performance of the \gls{dsa} fulfil the stated goal of this work. We have achieved performance gains through offloading data movement for \gls{qdp}, thereby demonstrating the \gls{dsa}s potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"