begin rewrite of chapter 5 with todos

remove bar and atc from glossary and reword where they were used in state, this results in easier to understand sentences and removes unnecessary detail
remove remote memory from the thesis. as rdma offloads the operation to the network card i see no application of dsa to remote memory
11 changed files with 88 additions and 147 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/02_abstract.tex
+++ b/thesis/content/02_abstract.tex
@ -8,7 +8,7 @@
 % geben (für irgendetwas müssen die Betreuer ja auch noch da
 % sein).

-This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the \gls{dsa}.
+This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM) and High Bandwidth Memory (HBM). Systems equipped with more than one type of main memory or employing a Non-Uniform Memory Architecture (NUMA) necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the \gls{dsa}.

 %%% Local Variables:
 %%% TeX-master: "diplom"
--- a/thesis/content/10_introduction.tex
+++ b/thesis/content/10_introduction.tex
@ -12,7 +12,7 @@
 % den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
 % als 10 Seiten liest keiner.

-The proliferation of various technologies, such as \gls{nvram}, \gls{hbm}, and \gls{remotemem}, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}.  \par
+The proliferation of various technologies, such as \gls{nvram} and \gls{hbm}, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Extending traditional \glsentryfirst{numa}, these systems necessitate the movement of data across memory classes and locations to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}.  \par

 The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the architecture of the Intel \gls{dsa}. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. \par

--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -72,14 +72,12 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
    \label{fig:dsa-internal-block}
 \end{figure}

-The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible as a PCIe device. Consequently, configuration utilizes memory-mapped registers set in the devices \gls{bar}. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par
+The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible and configurable as a PCIe device. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par

 \subsubsection{Architectural Components}
 \label{subsec:state:dsa-arch-comp}

-\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
-
-\todo{potentially give less details on the instructions or reformulate this some other way}
+\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. The method used to achieve this guarantee may result in higher submission cost \cite[Sec. 3.3.1]{intel:dsaspec}, compared to the \gls{dsa:dwq} to which a descriptor is submitted via a regular write \cite[Sec. 3.3.2]{intel:dsaspec}. \par

 \textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par

@ -88,7 +86,7 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
 \subsubsection{Virtual Address Resolution}
 \label{subsubsec:state:dsa-vaddr}

-An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
+An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the instruction used by \gls{dsa:swq} submission \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

 \subsubsection{Completion Signalling}
 \label{subsubsec:state:completion-signal}
@ -112,11 +110,12 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:

 Since the Linux Kernel version 5.10, a driver for the \gls{dsa} has been available, which currently lacks a counterpart on Windows Operating Systems \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par

-While a process could theoretically submit work to the \gls{dsa} using either the \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing descriptors through manual configuration, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par
+While a process could theoretically submit work to the \gls{dsa} by manually preparing descriptors and submitting them via special instructions, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par

 \section{Programming Interface for \glsentrylong{dsa}}
 \label{sec:state:dml}

+\ref{sec:}
 As mentioned in Section \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par

 \begin{figure}[!t]
--- a/thesis/content/40_design.tex
+++ b/thesis/content/40_design.tex
@ -36,12 +36,12 @@ The interface of \texttt{Cache} must provide three basic functions: (1) requesti

 Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. \par

-Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}.  Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
+Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par

 \subsection{Policy Functions}
 \label{subsec:design:policy-functions}

-In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring function pointers performing dynamic memory management be passed.  As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par
+In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring function pointers performing dynamic memory management be passed. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par

 The policy functions receive parameters deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par

@ -50,7 +50,7 @@ For memory management, two functions are required, providing allocation (\texttt
 \subsection{Cache Entry Reuse}
 \label{subsec:design:cache-entry-reuse}

-When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must therefore also ensure that, on concurrent access to the same resource, only one Thread creates \texttt{CacheData} while the others are provided with a copy. \par
+When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must therefore also ensure that, on concurrent access to the same resource, only one thread creates \texttt{CacheData} while the others are provided with a copy. \par

 \subsection{Cache Entry Lifetime}
 \label{subsec:design:cache-entry-lifetime}
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -28,30 +28,27 @@ The usage of locking and atomics to achieve safe concurrent access has proven to

 Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. Use of a handler is also displayed in the \texttt{memcpy}-function for the \gls{dsa} as shown in Figure \ref{fig:dml-memcpy}. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par

-\subsection{Cache: Locking for Access to State} \label{subsec:implementation:cache-state-lock}
+\subsection{Cache: Locking for Access to State}
+\label{subsec:impl:cache-state-lock}

 To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:interface} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par

-A map-datastructure was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other \gls{numa:node}s. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per \gls{numa:node} state, also allowing other \gls{numa:node} to progress.\par
-
-\todo{this is somewhat obscurely worded}
+A map was chosen as the data structure to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} can not hinder progress of caching operations on other \gls{numa:node}s, while also reducing potential for lock contention by spreading the load over multiple locks. \par

 Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par

-\subsection{CacheData: Fair Threadsafe Implementation for WaitOnCompletion}
-
-The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
-
-As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added to the cache state before submission. These two aspects necessitate modifying the handlers atomically. We therefore use an atomic pointer in \texttt{CacheData} to allow safe exchange and waiting on modification. \par
+\subsection{CacheData: Shared Reference}
+\label{subsec:impl:cachedata-ref}

-\todo{this paragraph really does not flow from the previous}
+The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. Using \texttt{std::shared\_ptr<T>} additionally introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations. Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par

-Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
+\subsection{CacheData: Fair and Threadsafe WaitOnCompletion}

-\todo{use of "also" is unfitting as there is no previous antipoint to sharedptr}
+As Section \ref{subsec:design:cache-entry-reuse} details, we intend to share a cache entry between multiple threads, necessitating synchronization between multiple instances of \texttt{CacheData} for waiting on operation completion and determining the cache location. While the latter is easily achieved by making the cache pointer a shared atomic value, the former proved to be challenging. Therefore, we will iteratively develop our implementation for \texttt{CacheData::WaitOnCompletion} in this Section. We present the challenges encountered and describe how these were solved to achieve the fairness and thread safety we desire. \par

-Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
+We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads, as no guarantees were found in the documentation. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in Section \ref{subsec:design:cache-entry-reuse}, threads need to coordinate on a master thread which performs the actual waiting, while the others wait on the master. \par

+Upon call to \texttt{Cache::Access}, coordination must again take place to only add one instance of \texttt{CacheData} to the cache state and, most importantly, submit to the \gls{dsa} only once. This behaviour was elected in Section \ref{subsec:design:cache-entry-reuse} to reduce the load placed on the accelerator by preventing duplicate submission. To solve this, \texttt{Cache::Access} will add the instance to the cache state under unique lock (see Section \ref{subsec:impl:cache-state-lock} above) and only then, when the current thread is guaranteed to be the holder of the unique instance, submission will take place. Thereby, the cache state may contain\texttt{CacheData} without a valid cache pointer and no handlers available to wait on. As the following paragraph and sections demonstrate, this resulted in implementation challenges. \par

 \begin{figure}[!t]
    \centering
@ -60,11 +57,9 @@ Therefore, the decision was made to implement atomic reference counting for \tex
    \label{fig:impl-cachedata-threadseq-waitoncompletion}
 \end{figure}

-Due to the possibility of access by multiple threads, the implementation of \texttt{CacheData::WaitOnCompletion} proved to be challenging. In the first implementation, a thread would check if the handlers are available and atomically wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. \par
+In the first implementation, a thread would check if the handlers are available and atomically wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\)  is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par

-To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\)  is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
-
-\todo{both paragraphs above are complicated to read}
+\todo{paragraph above complicated to read}

 \begin{figure}[!t]
    \centering
@ -78,11 +73,12 @@ As a solution for this, a more intricate implementation is required. When waitin
 \todo{complicated formulation too, write it with more references to the pseudocode}

 \subsection{CacheData: Edge Cases and Deadlocks}
+\label{subsec:impl:cachedata-deadlocks}

 With the outlines of a fair implementation of \texttt{CacheData::WaitOnCompletion} drawn, we will now move our focus to the safety of \texttt{CacheData}. Specifically the following Sections will discuss possible deadlocks and their resolution. \par

 \subsubsection{Initial Invalid State}
-\label{subsubsec:impl:cdatomicity:initial-invalid-state}
+\label{subsubsec:impl:cachedata:initial-invalid-state}

 We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from \texttt{nullptr} to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par

@ -96,22 +92,13 @@ To circumvent this deadlock, the initial state of \texttt{CacheData} was modifie

 \subsubsection{Invalid State on Operation Failure}

-\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to \texttt{nullptr} using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be \texttt{nullptr}. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par
-
-\todo{reference the figure above in here}
-\todo{double "in this case"}
-
-In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still \texttt{nullptr} and as the handlers have already been set and processed, they will also be \texttt{nullptr} without the chance of them ever becoming valid. As a solution, edge case handling is introduced and the cache pointer is set to the source address, providing validity. \par
-
-\todo{consider elaborating on the used edge case handling, we set "delete" to false basically}
+\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to \texttt{nullptr} using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used, resulting in both the handlers and the cache pointer being \texttt{nullptr}. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cachedata:initial-invalid-state}. In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still \texttt{nullptr} and as the handlers have already been set and processed, they will also be \texttt{nullptr} without the chance of them ever becoming valid. The employed solution can be seen in Figure \ref{fig:impl-cachedata-waitoncompletion}, where after processing the handlers we check for errors in their results and upon encountering an error, the cache pointer is set to the data source. Other threads could have accumulated, waiting for the cache to become valid, as the handlers were already invalidated before. By setting the cache to a valid value, these threads are released and any subsequent calls to \texttt{CacheData::WaitOnCompletion} will immediately return. Therefore, the \texttt{Cache} does not guarantee that data will actually be cached, however, as Chapter \ref{chap:design} does not make such a promise, we decided on implementing this edge case handling. \par

 \subsubsection{Locally Invalid State due to Race Condition}

-The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. \par
-
-As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
+\todo{wording for subsection is complicated}

-\todo{reference the figure better here, wording is quite complicated}
+The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par

 This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par

@ -119,16 +106,12 @@ One could use double-width atomic operations and introduce a counter which would

 The chosen solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par

-\todo{improve the explanation of secondary invalid, we have an empty handlers vector in cache and then pass a ref to it for use as a safe invalid pointer}
-
 This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par

 \section{Application to \glsentrylong{qdp}}
 \label{sec:impl:application}

-Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
-
-\todo{not instead here but due to analysis}
+Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion. This causes less delay from submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par

 During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par

--- a/thesis/content/60_evaluation.tex
+++ b/thesis/content/60_evaluation.tex
@ -36,7 +36,7 @@ We benchmarked two methods to establish a baseline and an upper limit as referen
 \begin{table}[!t]
    \centering
    \input{tables/table-qdp-baseline.tex}
-    \caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Raw Time is averaged over 5 iterations with previous warm up.}
+    \caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching.}
    \label{table:qdp-baseline}
 \end{table}

@ -72,7 +72,7 @@ To address the challenges posed by sharing memory bandwidth between both \(SCAN\
 \begin{table}[!t]
    \centering
    \input{tables/table-qdp-speedup.tex}
-    \caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes. Raw Time is averaged over 5 iterations with previous warm up.}
+    \caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
    \label{table:qdp-speedup}
 \end{table}

@ -107,7 +107,7 @@ In Section \ref{sec:eval:expectations}, we anticipated that the simple query wou

 The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par

-As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram} or \gls{remotemem}, as mentioned in Chapter \ref{chap:intro}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to \gls{nvram} for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
+As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include replication to \gls{nvram} for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par


 %%% Local Variables:
--- a/thesis/content/70_conclusion.tex
+++ b/thesis/content/70_conclusion.tex
@ -24,7 +24,7 @@ In Section \ref{sec:eval:observations}, we observed adverse effects when prefetc

 As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, which currently only supports busy-waiting\todo{mention that busy waiting goes against the rationale of offloading data copy, only usefull to reduce power consumption for sync-copy, cite dsaanalysis for this}. \todo{we could also mention dwq to reduce overhead of cache and therefore time spent in scanb} \par

-Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from \gls{remotemem} to faster local storage or replicating data to \gls{nvram}. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its applicability. A performance comparison between prefetching to \gls{hbm} using knowledge of the coming queries and the data they access, and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could also yield interesting insights. \par
+Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for replicating data to \gls{nvram} and might prove applicable to different use cases. Additional benchmarks on more complex queries for \gls{qdp} and a comparison between prefetching to \gls{hbm} using knowledge of the coming queries and the data they access, and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could yield deeper insights into the caches' performance. \par

 In conclusion, the \texttt{Cache}, together with the Sections on the \gls{dsa}'s architecture (Section \ref{sec:state:dsa}) and performance characteristics (Section \ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par

--- a/thesis/images/image-source/design-classdiagram.xml
+++ b/thesis/images/image-source/design-classdiagram.xml
@ -1,80 +1,86 @@
-<mxfile host="app.diagrams.net" modified="2024-02-15T12:40:37.148Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" etag="bDsGiENRX6gaS5lY3tf3" version="22.1.21" type="device">
+<mxfile host="app.diagrams.net" modified="2024-02-15T18:57:06.815Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" etag="3ZOmdd_hiFPwvof3FZmx" version="23.1.4" type="device">
  <diagram name="Page-1" id="xBjmK5o3fU9FCVY3KYoo">
-    <mxGraphModel dx="276" dy="154" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="827" pageHeight="369" math="0" shadow="0">
+    <mxGraphModel dx="989" dy="554" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="520" pageHeight="310" math="0" shadow="0">
      <root>
        <mxCell id="0" />
        <mxCell id="1" parent="0" />
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-23" value="" style="group;fillColor=#d5e8d4;strokeColor=none;strokeWidth=9;container=0;" parent="1" vertex="1" connectable="0">
-          <mxGeometry x="130" y="40" width="230" height="190" as="geometry" />
+        <mxCell id="War1bj2KuZjwGKFROEHa-1" value="" style="group" vertex="1" connectable="0" parent="1">
+          <mxGeometry x="10" y="10" width="230" height="190" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-26" value="" style="group;fillColor=#fff2cc;strokeColor=#d6b656;container=0;" parent="1" vertex="1" connectable="0">
-          <mxGeometry x="440" y="40" width="230" height="270" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-23" value="" style="group;fillColor=#d5e8d4;strokeColor=none;strokeWidth=9;container=0;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1" connectable="0">
+          <mxGeometry width="230" height="190" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-14" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="1" vertex="1">
-          <mxGeometry x="440" y="40" width="230" height="190" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry width="230" height="190" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-15" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=3;fillColor=none;" parent="1" vertex="1">
-          <mxGeometry x="440" y="40" width="230" height="290" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-2" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=3;fillColor=none;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry width="230" height="190" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-16" value="&lt;b&gt;Cache&lt;br&gt;&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=1;" parent="1" vertex="1">
-          <mxGeometry x="440" y="40" width="230" height="40" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-3" value="&lt;b&gt;CacheData&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=1;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry width="230" height="40" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-17" value="Cache() = default" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="1" vertex="1">
-          <mxGeometry x="440" y="80" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-5" value="CacheData(uint8_t* data, size_t size)" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry y="40" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-18" value="void Flush(int node = -1)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="440" y="245" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-6" value="CacheData(const CacheData&amp;amp; other)" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry y="70" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-19" value="~Cache()" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="440" y="110" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-7" value="~CacheData()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry y="100" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-20" value="void Init(CachePolicy*, CopyPolicy*, MemFree*, MemAlloc*)" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="440" y="140" width="230" height="60" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-8" value="void WaitOnCompletion()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry y="130" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-21" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="1" source="tB5LUjmhD6zCi5oJ_og_-20" target="tB5LUjmhD6zCi5oJ_og_-20" edge="1">
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-9" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="War1bj2KuZjwGKFROEHa-1" source="tB5LUjmhD6zCi5oJ_og_-8" target="tB5LUjmhD6zCi5oJ_og_-8" edge="1">
          <mxGeometry width="50" height="50" relative="1" as="geometry">
-            <mxPoint x="970" y="410" as="sourcePoint" />
-            <mxPoint x="1020" y="360" as="targetPoint" />
+            <mxPoint x="530" y="370" as="sourcePoint" />
+            <mxPoint x="580" y="320" as="targetPoint" />
          </mxGeometry>
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-22" value="std::unique_ptr&amp;lt;CacheData&amp;gt; Access(uint8_t* data, size_t size)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="440" y="195" width="230" height="50" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-11" value="uint8_t* GetDataLocation() const" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-1" vertex="1">
+          <mxGeometry y="160" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-24" value="void Clear()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="440" y="275" width="230" height="30" as="geometry" />
+        <mxCell id="War1bj2KuZjwGKFROEHa-2" value="" style="group" vertex="1" connectable="0" parent="1">
+          <mxGeometry x="280" y="10" width="230" height="290" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-25" value="void Invalidate()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="440" y="305" width="230" height="25" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-26" value="" style="group;fillColor=#fff2cc;strokeColor=#d6b656;container=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1" connectable="0">
+          <mxGeometry width="230" height="270" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="1" vertex="1">
-          <mxGeometry x="130" y="40" width="230" height="190" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-14" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry width="230" height="190" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-2" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=3;fillColor=none;" parent="1" vertex="1">
-          <mxGeometry x="130" y="40" width="230" height="190" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-15" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=3;fillColor=none;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry width="230" height="290" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-3" value="&lt;b&gt;CacheData&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=1;" parent="1" vertex="1">
-          <mxGeometry x="130" y="40" width="230" height="40" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-16" value="&lt;b&gt;Cache&lt;br&gt;&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=1;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry width="230" height="40" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-5" value="CacheData(uint8_t* data, size_t size)" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="1" vertex="1">
-          <mxGeometry x="130" y="80" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-17" value="Cache() = default" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="40" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-6" value="CacheData(const CacheData&amp;amp; other)" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="130" y="110" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-18" value="void Flush(int node = -1)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="205" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-7" value="~CacheData()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="130" y="140" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-19" value="~Cache()" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="70" width="230" height="30" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-8" value="void WaitOnCompletion()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="130" y="170" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-20" value="void Init(CachePolicy*, CopyPolicy*, MemFree*, MemAlloc*)" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="100" width="230" height="60" as="geometry" />
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-9" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="1" source="tB5LUjmhD6zCi5oJ_og_-8" target="tB5LUjmhD6zCi5oJ_og_-8" edge="1">
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-21" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="War1bj2KuZjwGKFROEHa-2" source="tB5LUjmhD6zCi5oJ_og_-20" target="tB5LUjmhD6zCi5oJ_og_-20" edge="1">
          <mxGeometry width="50" height="50" relative="1" as="geometry">
-            <mxPoint x="660" y="410" as="sourcePoint" />
-            <mxPoint x="710" y="360" as="targetPoint" />
+            <mxPoint x="530" y="370" as="sourcePoint" />
+            <mxPoint x="580" y="320" as="targetPoint" />
          </mxGeometry>
        </mxCell>
-        <mxCell id="tB5LUjmhD6zCi5oJ_og_-11" value="uint8_t* GetDataLocation() const" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
-          <mxGeometry x="130" y="200" width="230" height="30" as="geometry" />
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-22" value="std::unique_ptr&amp;lt;CacheData&amp;gt; Access(uint8_t* data, size_t size)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="155" width="230" height="50" as="geometry" />
+        </mxCell>
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-24" value="void Clear()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="235" width="230" height="30" as="geometry" />
+        </mxCell>
+        <mxCell id="tB5LUjmhD6zCi5oJ_og_-25" value="void Invalidate()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="War1bj2KuZjwGKFROEHa-2" vertex="1">
+          <mxGeometry y="265" width="230" height="25" as="geometry" />
        </mxCell>
      </root>
    </mxGraphModel>
--- a/thesis/images/uml-cache-and-cachedata.pdf
+++ b/thesis/images/uml-cache-and-cachedata.pdf
--- a/thesis/own.gls
+++ b/thesis/own.gls
@ -6,22 +6,6 @@
    description={\textsc{\glsentrylong{iommu}:} \todo{write iomu description}}
 }

-\newglossaryentry{atc}{
-    short={ATC},
-    name={ATC},
-    long={Address Translation Cache},
-    first={Address Translation Cache (ATC)},
-    description={\textsc{\glsentrylong{atc}:} \todo{write arc description}}
-}
-
-\newglossaryentry{bar}{
-    short={BAR},
-    name={BAR},
-    long={Base Address Register},
-    first={Base Address Register (BAR)},
-    description={\textsc{\glsentrylong{bar}:} \todo{write bar description}}
-}
-
 \newglossaryentry{dsa}{
    short={DSA},
    name={DSA},
@ -35,7 +19,7 @@
    name={WQ},
    long={Work Queue},
    first={Work Queue (WQ)},
-    description={\textsc{\glsentrylong{dsa:wq}:}  Architectural component of the \gls{dsa} to which data is submitted by the user. See Section \ref{subsec:state:dsa-arch-comp} for more detail on its function.}
+    description={\textsc{\glsentrylong{dsa:wq}:} Architectural component of the \gls{dsa} to which data is submitted by the user. See Section \ref{subsec:state:dsa-arch-comp} for more detail on its function.}
 }

 \newglossaryentry{dsa:swq}{
@ -54,30 +38,6 @@
    description={\textsc{\glsentrylong{dsa:dwq}:} A type of Work Queue only usable by one process, and therefore with potentially lower submission overhead. See Section \ref{subsec:state:dsa-arch-comp} for more detail.}
 }

-\newglossaryentry{pcie-dmr}{
-    short={DMR},
-    name={DMR},
-    long={PCIe Deferrable Memory Write Request},
-    first={PCIe Deferrable Memory Write Request (DMR)},
-    description={\textsc{\glsentrylong{pcie-dmr}:} \todo{write pcie-dmr description}}
-}
-
-\newglossaryentry{x86:enqcmd}{
-    short={ENQCMD},
-    name={ENQCMD},
-    long={x86 Instruction ENQCMD},
-    first={x86 Instruction ENQCMD},
-    description={\textsc{\glsentrylong{x86:enqcmd}:} \todo{write enqcmd description}}
-}
-
-\newglossaryentry{x86:movdir64b}{
-    short={MOVDIR64B},
-    name={MOVDIR64B},
-    long={x86 Instruction MOVDIR64B},
-    first={x86 Instruction MOVDIR64B},
-    description={\textsc{\glsentrylong{x86:movdir64b}:} \todo{write movdir64b description}}
-}
-
 \newglossaryentry{x86:pasid}{
    short={PASID},
    name={PASID},
@ -146,14 +106,7 @@
    name={API},
    long={Application Programming Interface},
    first={Application Programming Interface (API)},
-    description={\textsc{\glsentrylong{api}:} Public functions exposed by a library, through which programs utilizing this library can interact with it.}
-}
-
-\newglossaryentry{remotemem}{
-    short={Remote Memory},
-    name={Remote Memory},
-    long={Remote Memory},
-    description={\textsc{\glsentrylong{remotemem}:} ... desc ... \todo{write remotemem description}}
+    description={\textsc{\glsentrylong{api}:} Definition of the interface provided by an application, enabling interaction between software components or systems.}
 }

 \newglossaryentry{nvram}{
Author	SHA1	Message	Date
Constantin Fürst	1b2653f04d	begin rewrite of chapter 5 with todos	3 months ago
Constantin Fürst	ac08fb50f9	remove bar and atc from glossary and reword where they were used in state, this results in easier to understand sentences and removes unnecessary detail	3 months ago
Constantin Fürst	9df7146897	remove remote memory from the thesis. as rdma offloads the operation to the network card i see no application of dsa to remote memory	3 months ago
Constantin Fürst	86add35f03	fix width of pdf for uml cache and cachedata	3 months ago
Constantin Fürst	714bc3c1ec	modify explanation of dsa-wq submission to not include so much detail on the actual instructions. remove these from the glossary too, improve gls for API	3 months ago
Constantin Fürst	020719f6fd	shorten table description, as the info on benchmark repetition is now located in text above	3 months ago