first read-through iteration, adding a large amount of todos

10 months ago · f359a70960
9 changed files with 95 additions and 32 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/02_abstract.tex
+++ b/thesis/content/02_abstract.tex
@ -10,6 +10,8 @@

 This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile implementation of a cache, offloading copy operations to the DSA.

+\todo{double "offloading" maybe reformulate the last sentence}
+
 %%% Local Variables:
 %%% TeX-master: "diplom"
 %%% End:
--- a/thesis/content/10_introduction.tex
+++ b/thesis/content/10_introduction.tex
@ -20,6 +20,8 @@ This work introduces significant contributions to the field. Notably, the design

 We begin the work by furnishing the reader with pertinent technical information necessary for understanding the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Methodologies are presented, each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par

+\todo{shorten intro by two lines so that it fits on one page}
+
 %%% Local Variables:
 %%% TeX-master: "diplom"
 %%% End:
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -29,7 +29,7 @@
 % Kapitel wird in der Regel zuerst geschrieben und ist das Einfachste
 % (oder das Schwerste weil erste).

-This chapter introduces the relevant technologies and concepts for this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both of these terms. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.
+This chapter introduces the relevant technologies and concepts for this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.

 \section{\glsentrylong{hbm}}
 \label{sec:state:hbm}
@ -37,6 +37,7 @@ This chapter introduces the relevant technologies and concepts for this thesis.
 \glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par

 \todo{find a nice graphic for hbm}
+\todo{potential for futurework: analyze performance of using cache-mode}

 \section{\glsentrylong{qdp}}
 \label{sec:state:qdp}
@ -50,7 +51,7 @@ This chapter introduces the relevant technologies and concepts for this thesis.

 \gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par

-Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks. Therefore, a high degree of concurrency may be observed, resulting in demand for CPU cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. For this reason, we intend to offload copy operations to the \gls{dsa} in this work, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \par
+Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks. Therefore, a high degree of concurrency may be observed, resulting in demand for CPU cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. For this reason, we intend to offload copy operations to the \gls{dsa} in this work, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \cite{dimes-prefetching} \par

 \section{\glsentrylong{dsa}}
 \label{sec:state:dsa}
@ -66,19 +67,21 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
    \label{fig:dsa-internal-block}
 \end{figure}

-The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible as a PCIe device. Consequently, configuration utilizes memory-mapped registers set in the devices \gls{bar}. Through these registers, the devices' layout is defined and memory pages for work submission set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par
+The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible as a PCIe device. Consequently, configuration utilizes memory-mapped registers set in the devices \gls{bar}. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par

 \subsubsection{Architectural Components}

 \textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

+\todo{potentially give less details on the instructions or reformulate this some other way}
+
 \textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par

 \textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par

 \subsubsection{Virtual Address Resolution}

-An important aspect of modern computer systems is the separation of address spaces through virtual memory. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
+An important aspect of modern computer systems is the separation of address spaces through virtual memory \todo{maybe add cite and reformulate}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory \todo{of what}, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

 \subsubsection{Completion Signalling}
 \label{subsubsec:state:completion-signal}
@ -105,6 +108,8 @@ With the appropriate file permissions, a process could submit work to the \gls{d

 With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par

+\todo{reformulate section}
+
 \section{Programming Interface for \glsentrylong{dsa}}
 \label{sec:state:dml}

@ -121,7 +126,7 @@ In the function header of Figure \ref{fig:dml-memcpy} two differences from stand

 The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par

-Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Section \ref{subsection:dsa-hwarch}, the node ID is equivalent to the ID of the \gls{dsa}. \par
+Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Section \ref{subsection:dsa-hwarch} \todo{not really observed there, cite maybe}, the node ID is equivalent to the ID of the \gls{dsa}. \par

 \gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par

@ -138,7 +143,7 @@ In this section we provide a step-by-step guide to replicate the configuration b
    \item Use a kernel with IDXD driver support, available from Linux 5.10 or later \cite[Sec. Installation]{intel:dmldoc} and append the following to the kernel boot parameters in grub config: \texttt{intel\_iommu=on,sm\_on} \cite[p. 5]{lenovo:dsa}
    \item Evaluate correct detection of \gls{dsa} devices using \texttt{dmesg | grep idxd} which should list as many devices as NUMA nodes on the system  \cite[p. 5]{lenovo:dsa}
    \item Install \texttt{accel-config} from repository \cite{intel:libaccel-config-repo} or system package manager and inspect the detection of \gls{dsa} devices through the driver using \texttt{accel-config list -i} \cite[p. 6]{lenovo:dsa}
-    \item Create \gls{dsa} configuration file for which we provide an example under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo} that is used for most benchmarks available. Then apply the configuration using \texttt{accel-config load-config -c [filename] -e} \cite[Fig. 3-9]{intel:dsaguide}
+    \item Create \gls{dsa} configuration file for which we provide an example under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo} that is also applied for the benchmarks. Then apply the configuration using \texttt{accel-config load-config -c [filename] -e} \cite[Fig. 3-9]{intel:dsaguide}
    \item Inspect the now configured \gls{dsa} devices using \texttt{accel-config list}  \cite[p. 7]{lenovo:dsa}, output should match the desired configuration set in the file used
 \end{enumerate}

--- a/thesis/content/30_performance.tex
+++ b/thesis/content/30_performance.tex
@ -5,6 +5,8 @@ In this chapter, we measure the performance of the \gls{dsa}, with the goal to d

 The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}. Therefore, we will perform only a limited amount of benchmarks with the purpose of verifying the figures from \cite{intel:analysis} and analysing best practices and restrictions for applying \gls{dsa} to \gls{qdp}. \par

+\todo{reformulate}
+
 \section{Benchmarking Methodology}
 \label{sec:perf:method}

@ -17,12 +19,18 @@ The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper e

 The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 nodes that have access to 16 GiB of \gls{hbm} and 12 cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \par

+\todo{refer to intel ark as cite}
+
 As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s, we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par

+\todo{refer back to state 2.4 where this is mentioned}
+
 The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par

 Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par

+\todo{mention configuration used, potential for futurework: evaluate multiengine per group performance}
+
 \begin{figure}[!t]
    \centering
    \includegraphics[width=0.35\textwidth]{images/nsd-benchmark.pdf}
@ -32,6 +40,8 @@ Timing in the outer loop may display lower throughput than actual. This is the c

 To get accurate results, the benchmark is repeated 10 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. \par

+\todo{use concrete amount instead of "configurable"}
+
 \begin{figure}[!t]
    \centering
    \includegraphics[width=1.0\textwidth]{images/nsd-benchmark-inner.pdf}
@ -44,7 +54,9 @@ For all \gls{dsa}s used in the benchmark, a submission thread executing the inne
 \section{Benchmarks}
 \label{sec:perf:bench}

-In this section we will In this section, we will present three benchmarks, each accompanied by setup information and a preview. We will then provide plots displaying the results, followed by a detailed analysis. We will formulate expectations and compare them with the observations from our measurements. \par
+In this Section, we will present three benchmarks, each accompanied by setup information and a preview. We will then provide plots displaying the results, followed by a detailed analysis. We will formulate expectations and compare them with the observations from our measurements. \par
+
+\todo{reformulate}

 \subsection{Submission Method}
 \label{subsec:perf:submitmethod}
@ -62,6 +74,8 @@ We anticipate that single submissions will consistently yield poorer performance

 In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. observed that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[pp. 6]{intel:analysis}. We however observe a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. \par

+\todo{submission method still makes a difference for 1mib, therefore recommend even larger transfers}
+
 Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par

 \subsection{Multithreaded Submission}
@ -89,17 +103,17 @@ The naming scheme contains the data rate in Megatransfers per second. We calcula

 The data width of DDR5 is 64 bit. We calculate the amount of Bytes per Transfer. \cite{kingston:ddr5-spec-overview} \par

-\[\frac{64b}{8b/B}\ /\ Transfer = 8B\ /\ Transfer\]
+\[\frac{64b}{8b/B}\ /\ T = 8B\ /\ T\]

 Using the results from the previous calculations, we are now able to calculate the theoretical peak throughput speed per Node on our test system. \par

 \[9600\ MT/s * 8B/T = 75\ GiB/s\]

-We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}. \par
+We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s \todo{mention why we conclude this}. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. \par

 To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilise the ones found on data source and destination node. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par

-For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
+For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node. We present data for nodes 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par

 \begin{figure}[!t]
    \centering
@ -140,6 +154,8 @@ Unexpected throughput differences are evident for all configurations, except the

 For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par

+\todo{ensure consistent usage of gls for node}
+
 \begin{figure}[!t]
    \centering
    \begin{subfigure}{0.225\textwidth}
@ -159,7 +175,7 @@ For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-
    \label{fig:perf-dsa-analysis}
 \end{figure}

-When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa:2}, performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa-analysis:scaling}, using Brute-Force still leads to a slight increase in overall throughput, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \par
+When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa:2}, performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa-analysis:scaling}, using Brute-Force still leads to a slight increase in overall throughput, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \todo{we state that it decreases but then that it increases} \par

 From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par

@ -168,7 +184,7 @@ The choice of a load balancing method is not trivial. If peak throughput of one
 \subsection{Data Movement using CPU}
 \label{subsec:perf:cpu-datacopy}

-For evaluating CPU copy performance we use the benchmark code from the subsequent Section \ref{subsec:perf:datacopy}, selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par
+For evaluating CPU copy performance we use the benchmark code from the previous Section (Section \ref{subsec:perf:datacopy}), selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par

 \begin{figure}[!t]
    \centering
@ -189,22 +205,24 @@ For evaluating CPU copy performance we use the benchmark code from the subsequen
    \label{fig:perf-cpu}
 \end{figure}

-As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. As the sole benchmark, the software path however performs as expected for transfers to \gls{numa:node}s 12 and 15, with the latter performing worse. Taking the layout from Figure \ref{fig:perf-xeonmaxnuma}, back in the previous Section, we assumed that \gls{numa:node} 12 would outperform \gls{numa:node} 15 due to lower physical distance. This assumption was invalidated, making the result for CPU in this case unexpected. \par
+As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. As the sole benchmark, the software path however performs as expected for transfers to \gls{numa:node}s 12 and 15, with the latter performing worse \todo{false statement}. Taking the layout from Figure \ref{fig:perf-xeonmaxnuma}, back in the previous Section, we assumed that \gls{numa:node} 12 would outperform \gls{numa:node} 15 due to lower physical distance. This assumption was invalidated, making the result for CPU in this case unexpected. \par
+
+In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios \todo{false they exhibit it, but not as pronounced}. \par

-In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios. \par
+\todo{ensure correctness}

 \section{Analysis}
 \label{sec:perf:analysis}

-In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm}. \par
+In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline \todo{we dont do this, either write it or leave this out}. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm} \todo{weird wording}. \par

 \begin{itemize}
-    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB and otherwise use the CPU.
+    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB and otherwise use the CPU \todo{why otherwise cpu, no direct link given}.
    \item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
    \item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
 \end{itemize}

-Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs similar to the CPU, demonstrating potential for faster data movement while simultaneously freeing up cycles for other tasks. \par
+Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs similar to the CPU, demonstrating potential for faster data movement while simultaneously freeing up cycles for other tasks. \todo{ensure correctness and reformulate} \par

 We discovered an anomaly for \gls{numa:node} 12 for which we did not find an explanation. As the behaviour is also exhibited by the CPU, discovering the root issue falls outside the scope of this work. \par

--- a/thesis/content/40_design.tex
+++ b/thesis/content/40_design.tex
@ -18,13 +18,13 @@
 % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
 % sein, daß das Abstraktionsniveau verfehlt wurde.

-In this chapter we design a class interface for use as a general purpose cache. We will present challenges and then detail the solutions employed to face them, finalizing the architecture with each step. Details on the implementation of this blueprint will be omitted, as we discuss a selection of relevant aspects in Chapter \ref{chap:implementation}. We also shortly touch the subject of \gls{dsa} usage. \par
+In this chapter we design a class interface for use as a general purpose cache. We will present challenges and then detail the solutions employed to face them, finalizing the architecture with each step. Details on the implementation of this blueprint will be omitted, as we discuss a selection of relevant aspects in Chapter \ref{chap:implementation}. \par

 \section{Cache Design} \label{sec:design:cache}

-The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond \gls{qdp}, the decision was made to address the prefetching in \gls{qdp} by implementing an offloading \texttt{Cache}. Henceforth, when referring to the provided implementation, we will use \texttt{Cache}. \par
+The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond \gls{qdp}, the decision was made to address the prefetching in \gls{qdp} by implementing an offloading \texttt{Cache}. Henceforth, when referring to the provided implementation, we will use \texttt{Cache}. \todo{reformulate} \par

-The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating an update either from the source or vice versa. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
+The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par

 \begin{figure}[!t]
    \centering
@ -36,12 +36,13 @@ The interface of \texttt{Cache} must provide three basic functions: (1) requesti
 \subsection{Interface}

 \todo{mention custom malloc for outsourcing allocation to user}
+\todo{give better intro for this subsection}

-To facilitate rapid integration and alleviate developer workload, we opted for a simple interface. Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. The user retains control over cache placement and the assignment of tasks to accelerators through mechanisms outlined in \ref{sec:design:accel-usage}. This interface is represented on the right block of Figure \ref{fig:impl-design-interface} labelled \enquote{Cache} and includes some additional operations beyond the basic requirements. \par
+Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. The user retains control over cache placement and the assignment of tasks to accelerators through mechanisms outlined in \ref{sec:design:accel-usage}. This interface is represented on the right block of Figure \ref{fig:impl-design-interface} labelled \enquote{Cache} and includes some additional operations beyond the basic requirements. \par

 Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. \par

-To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}.  Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
+To facilitate this process \todo{not only for this, cachedata is also needed to allow async operation}, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}.  Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par

 \subsection{Cache Entry Reuse} \label{subsec:design:cache-entry-reuse}

@ -63,6 +64,7 @@ Due to its reliance on libnuma for memory allocation, \texttt{Cache} is exclusiv

 \todo{remove the libnuma restriction}
 \todo{mention undefined behaviour when writing to cache source before having waited for caching operation completion}
+\todo{mention undefined behaviour through read-after-write pass according to ordering guarantees by dsa}

 \section{Accelerator Usage}
 \label{sec:design:accel-usage}
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -22,11 +22,13 @@

 In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Finally, we apply the cache to \glsentrylong{qdp}, detailing the policies mentioned in Section \ref{sec:design:accel-usage} and presenting solutions for the challenges encountered. \par

+\todo{potentially mention that most of what this chapter discusses only is internal and therefore might affect but not necessitate action by the user}
+
 \section{Synchronization for Cache and CacheData}

 The usage of locking and atomics to achieve safe concurrent access has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will extensively focus on the details of their implementation. \par

-Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par
+Throughout the following sections we will use the term \enquote{handler} \todo{add backref to state 2.4 where handler is also mentioned}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par

 \subsection{Cache: Locking for Access to State} \label{subsec:implementation:cache-state-lock}

@ -34,6 +36,8 @@ To keep track of the current cache state the \texttt{Cache} will hold a referenc

 A map-datastructure was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other \gls{numa:node}s. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per \gls{numa:node} state, also allowing other \gls{numa:node} to progress.\par

+\todo{this is somewhat obscurely worded}
+
 Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par

 \subsection{CacheData: Fair Threadsafe Implementation for WaitOnCompletion}
@ -42,10 +46,15 @@ The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-saf

 As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added to the cache state before submission. These two aspects necessitate modifying the handlers atomically. We therefore use an atomic pointer in \texttt{CacheData} to allow safe exchange and waiting on modification. \par

+\todo{this paragraph really does not flow from the previous}
+
 Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par

+\todo{use of "also" is unfitting as there is no previous antipoint to sharedptr}
+
 Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par

+
 \begin{figure}[!t]
    \centering
    \includegraphics[width=0.9\textwidth]{images/seq-blocking-wait.pdf}
@ -57,6 +66,8 @@ Due to the possibility of access by multiple threads, the implementation of \tex

 To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\)  is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par

+\todo{both paragraphs above are complicated to read}
+
 \begin{figure}[!t]
    \centering
    \includegraphics[width=1.0\textwidth]{images/nsd-cachedata-waitoncompletion.pdf}
@ -66,6 +77,8 @@ To illustrate this, an exemplary scenario is used, as seen in the sequence diagr

 As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one} \cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with \texttt{nullptr}, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par

+\todo{complicated formulation too, write it with more references to the pseudocode}
+
 \subsection{CacheData: Edge Cases and Deadlocks}

 With the outlines of a fair implementation of \texttt{CacheData::WaitOnCompletion} drawn, we will now move our focus to the safety of \texttt{CacheData}. Specifically the following Sections will discuss possible deadlocks and their resolution. \par
@ -87,20 +100,29 @@ To circumvent this deadlock, the initial state of \texttt{CacheData} was modifie

 \texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to \texttt{nullptr} using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be \texttt{nullptr}. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par

+\todo{reference the figure above in here}
+\todo{double "in this case"}
+
 In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still \texttt{nullptr} and as the handlers have already been set and processed, they will also be \texttt{nullptr} without the chance of them ever becoming valid. As a solution, edge case handling is introduced and the cache pointer is set to the source address, providing validity. \par

+\todo{consider elaborating on the used edge case handling, we set "delete" to false basically}
+
 \subsubsection{Locally Invalid State due to Race Condition}

 The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. \par

 As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par

+\todo{reference the figure better here, wording is quite complicated}
+
 This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par

 One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of \texttt{nullptr} and flag-set. Other threads then would then wait on the struct changing from \texttt{nullptr} and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp} \par

 The chosen solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par

+\todo{improve the explanation of secondary invalid, we have an empty handlers vector in cache and then pass a ref to it for use as a safe invalid pointer}
+
 This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par

 \section{Application to \glsentrylong{qdp}}
@ -110,8 +132,12 @@ Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapte

 During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par

+\todo{write our observations and then link to the to-be-added section describing these updates}
+
 Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. With this we also remove the dependency on libnuma, mentioned as a restriction in Section \ref{subsec:design:restrictions}. \par

+\todo{dont write about modifying the design but adapt the design and add forward references there to here to explain necessity}
+
 %%% Local Variables:
 %%% TeX-master: "diplom"
 %%% End:
--- a/thesis/content/60_evaluation.tex
+++ b/thesis/content/60_evaluation.tex
@ -10,25 +10,29 @@
 % bezüglich der Ergebnisse zu erläutern und anschließend eventuell
 % festgestellte Abweichungen zu erklären.

-In this chapter, we establish anticipated outcomes for incorporating the developed Cache into the context of \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the choices made regarding the benchmarking methodology and their influence on our results. \par
+In this chapter, we establish anticipated outcomes for incorporating the developed Cache into \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the choices made regarding the benchmarking methodology and their influence on our results. \todo{extend last sentence with what we actually do in 6.4} \par

 \section{Benchmarked Task}
 \label{sec:eval:bench}

 The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at 4 GiB. The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching. \par

-The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. \par
+The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. \todo{mention that this has one of two side effects: queue overflow or requiring larger task sizes to prevent the overflow} \par
+
+\todo{possibly describe iteration count, warmup and averaging over iterations here}

 \section{Expectations}
 \label{sec:eval:expectations}

-The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that the \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly diminishes the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}. \par
+The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly reduces the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}. \par

 \section{Observations}
 \label{sec:eval:observations}

 In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influence performance \cite{dimes-prefetching}, which however is not inside the scope of this work. Therefore, we only present the best benchmarked results for each configuration. \par

+\todo{wording of last sentence is a bit odd}
+
 \subsection{Benchmarks without Prefetching}

 We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
@ -78,6 +82,8 @@ To address the challenges posed by sharing memory bandwidth between both \(SCAN\

 The slowdown below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, this result becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, and there is additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase compared to our baseline, although it does not reach the upper boundary set by simulating perfect prefetching. This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par

+\todo{add reference to the table for this paragraph}
+
 \begin{figure}[!t]
    \centering
    \begin{subfigure}[!t]{0.45\textwidth}
@ -99,15 +105,17 @@ The slowdown below our baseline when utilizing the \texttt{Cache} may be surpris

 In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. This prolonged duration of execution leads to extended overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate, minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par

+\todo{possibly mention that scanb/scana parallel have the same length as scanb is not directly affected by the bandwidth contention - it only depicts the overhead of caching and work submission - as the dsa itself performs the copy}
+
 Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par

 \section{Discussion}

 In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes. \par

-The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (refer to Table \ref{table:qdp-speedup}). \par
+The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par

-As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as Non-Volatile or Remote Memory, were not considered, as mentioned in Chapter \ref{chap:intro}. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to non-volatile memory for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
+As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as Non-Volatile or Remote Memory, as mentioned in Chapter \ref{chap:intro}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to non-volatile memory for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par


 %%% Local Variables:
--- a/thesis/content/70_conclusion.tex
+++ b/thesis/content/70_conclusion.tex
@ -18,15 +18,15 @@

 In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par

-Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, we acknowledge its limitations and effectiveness. At present, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, no ordering guarantees are provided, necessitating manual invalidation efforts by the programmer to ensure result validity (see Section \ref{subsec:design:restrictions}). To address this, a custom type could be developed to automatically trigger invalidation through the cache upon modification, utilizing timestamps or unique identifiers to verify that results are based on the current state. \par
+Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, we acknowledge its limitations and effectiveness \todo{weird wording, one would acknowledge negatives but not positives too}. At present, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, no ordering guarantees are provided, necessitating manual invalidation efforts by the programmer to ensure result validity \todo{efforts are not spent invalidating but ensuring validity, effort is required to remember everything in cache and then invalidate} (see Section \ref{subsec:design:restrictions}). To address this, a custom \todo{potentiall call this a cache-accessor-type and elaborate slightly on how it works, holds the data pointer internally and only allows modification through its functions which then invalidate as last step} type could be developed to automatically trigger invalidation through the cache upon modification, utilizing timestamps or unique identifiers to verify that results are based on the current state \todo{spongy wording}. \par

 In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par

-As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, which currently only supports busy-waiting. \par
+As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, which currently only supports busy-waiting\todo{mention that busy waiting goes against the rationale of offloading data copy, only usefull to reduce power consumption for sync-copy, cite dsaanalysis for this}. \todo{we could also mention dwq to reduce overhead of cache and therefore time spent in scanb} \par

-Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from remote memory to faster local storage or replicating data to Non-Volatile RAM. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its performance in for these applications. \par
+Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from remote memory to faster local storage or replicating data to Non-Volatile RAM. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its applicability. \par

-In conclusion, the \texttt{Cache}, together with the sections on the \gls{dsa}'s architecture (\ref{sec:state:dsa}) and performance characteristics (\ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par
+In conclusion, the \texttt{Cache}, together with the Sections on the \gls{dsa}'s architecture (Section \ref{sec:state:dsa}) and performance characteristics (Section \ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"