review of everything written today, some rewording and mostly adding todos to anything that sticks out as less than optimal

11 months ago · 56805a6ad3
4 changed files with 20 additions and 20 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -34,7 +34,7 @@
 \section{High Bandwidth Memory}
 \label{sec:state:hbm}

-\glsentrylong{hbm} is a novel memory technology promising an increase in peak bandwidth. It is composed of stacked DRAM dies \cite[p. 1]{hbm-arch-paper} and is slowly being integrated into server processors, notably the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief}. \gls{hbm} on these systems may be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes while the latter utilizes the \gls{hbm} as cache for the systems DDR based main memory \cite{intel:xeonmaxbrief}. \par
+\glsentrylong{hbm} is a novel memory technology promising an increase in peak bandwidth. It is composed of stacked DRAM dies \cite[p. 1]{hbm-arch-paper} and is slowly being integrated into server processors, the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems may be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes while the latter utilizes the \gls{hbm} as cache for the systems DDR based main memory \cite{intel:xeonmaxbrief}. \par

 \section{Query Driven Prefetching}

@ -55,13 +55,13 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
    \label{fig:dsa-internal-block}
 \end{figure}

-The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface over which all communication is conducted. Through this interface, it is accessible as a PCIe device. Therefore, configuration utilizes memory-mapped registers set in the devices \gls{bar}. Through these, the devices' layout is defined and memory pages for work submission set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in 4 being present on the previously mentioned Xeon Max CPU. \par
+The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface over which all communication is conducted. Through this interface, it is accessible as a PCIe device. Therefore, configuration utilizes memory-mapped registers set in the devices \gls{bar}. Through these, the devices' layout is defined and memory pages for work submission set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in 4 being present on the previously mentioned Xeon Max CPU. \todo{add citations to this section} \par

-To satisfy different use cases, the layout of the \gls{dsa} may be software-defined. The structure is made up of three components, namely \gls{dsa:wq}, \gls{dsa:engine} and \gls{dsa:group}. \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. An \gls{dsa:engine} is the processing-block that connects to memory and performs the described task. The grey block of Figure \ref{fig:dsa-internal-block} shows the subcomponents that make up an engine and the different internal paths for a batch or task descriptor. Using \gls{dsa:group}s, \gls{dsa:engine}s and \gls{dsa:wq}s are tied together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple \gls{dsa:engine}s and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
+To satisfy different use cases, the layout of the \gls{dsa} may be software-defined. The structure is made up of three components, namely \gls{dsa:wq}, \gls{dsa:engine} and \gls{dsa:group}. \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. An \gls{dsa:engine} is the processing-block that connects to memory and performs the described task. The grey block of Figure \ref{fig:dsa-internal-block} shows the subcomponents that make up an engine and the different internal paths for a batch or task descriptor \todo{too much detail for this being the first overview paragraph}. Using \gls{dsa:group}s, \gls{dsa:engine}s and \gls{dsa:wq}s are tied together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple \gls{dsa:engine}s and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par

 A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

-To handle the different descriptors, each \gls{dsa:engine} has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors for the batch from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An \gls{dsa:engine} can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
+To handle the different descriptors, each \gls{dsa:engine} has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \todo{cite this}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An \gls{dsa:engine} can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par

 Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one \gls{dsa:engine} in a \gls{dsa:group} when submitting exclusively batch or task descriptors but no mixture. Even then, only write-ordering is guaranteed, meaning that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. A different issue arises, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Care must therefore be taken with read-after-write scenarios, either by waiting for a successful completion before submitting the dependant, inserting a drain descriptor for tasks or setting the fence flag for a batch. The latter two methods tell the processing engine that all writes must be committed and, in case of the fence in a batch, abort on previous error. \cite[Sec. 3.9]{intel:dsaspec} \par

@ -79,11 +79,11 @@ The completion of a descriptor may be signalled through a completion record and
    \label{fig:dsa-software-arch}
 \end{figure}

-Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc}, meaning that accessing the \gls{dsa} is not possible to user space applications. To interface with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously, this can be seen in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, light green and labled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}, to achieve this. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
+Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc}, meaning that accessing the \gls{dsa} is only possible under Linux. To interface with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously. This can be seen in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, light green and labled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}, to achieve this. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par

 Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why \gls{intel:dml} exists. \par

-With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par
+With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par

 \section{Programming Interface}
 \label{sec:state:dml}
@ -93,7 +93,7 @@ As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml}
 \begin{figure}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{images/structo-dmlmemcpy.png}
-    \caption{DML Memcpy Implementation}
+    \caption{DML Memcpy Implementation Pseudocode}
    \label{fig:dml-memcpy}
 \end{figure}

@ -101,11 +101,11 @@ In the function header of Figure \ref{fig:dml-memcpy} we notice two differences,

 Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa\_run\_on\_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par

-\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)} with which we create views for both source and destination, labled \texttt{src\_view} and \texttt{dst\_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par
+\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to provide locations to the library. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)} with which we create views for both source and destination, labled \texttt{src\_view} and \texttt{dst\_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par

-For submission, we chose to use the asynchronous style of a single descriptor in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit<path>}, which takes an operation and operation specific parameters in and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem\_copy}, we again notice one thing sticking out of the call. This is the addition to the operation specifier \texttt{.block\_on\_fault()} which submits the operation, so that it will handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc} \par
+We submit a single descriptor using the asynchronous operation from \gls{intel:dml} in Figure \ref{fig:dml-memcpy}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task which can later be queried for completion of the operation. Passing the source and destination views, together with the operation \texttt{dml::mem\_copy}, we again notice one element sticking out of the call. This is the addition of \texttt{.block\_on\_fault()} which lets the \gls{dsa} handle a page fault by coordinating with the operating system. This only works if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc} \par

-After submission, we poll for the task completion with \texttt{handler.get()} in Figure \ref{fig:dml-memcpy} and check whether the operation completed successfully. \par
+After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par

 \section{System Setup and Configuration} \label{sec:state:setup-and-config}

--- a/thesis/content/30_performance.tex
+++ b/thesis/content/30_performance.tex
@ -12,9 +12,9 @@ The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper e
    \label{fig:benchmark-function}
 \end{figure}

-Benchmarks were conducted on an Intel Xeon Max CPU, system configuration following Section \ref{sec:state:setup-and-config} with exclusive access to the system. As Intel's \gls{intel:dml} does not have support for \gls{dsa:dwq}, we ran benchmarks exclusively with access through \gls{dsa:swq}. The application written for the benchmarks can be obtained in source form under \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. With the full source code available we only briefly describe a section of pseudocode, as seen in Figure \ref{fig:benchmark-function}, in the following paragraph. \par
+Benchmarks were conducted on an Intel Xeon Max CPU, system configuration following Section \ref{sec:state:setup-and-config} with exclusive access to the system. As \gls{intel:dml} does not have support for \gls{dsa:dwq}, we ran benchmarks exclusively with access through \gls{dsa:swq}. The application written for the benchmarks can be obtained in source form under \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. With the full source code available we only briefly describe a section of pseudocode, as seen in Figure \ref{fig:benchmark-function}, in the following paragraph. \par

-The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. These timings are marked with yellow background in Figure \ref{fig:benchmark-function}, which shows that we measure three regions during our benchmark with one being the total time and the other two being time for submission and time for completion. To get accurate results, the benchmark is repeated multiple times and for small durations we only evaluate the total time, as the single measurements seemed to contain to many irregularities. At the beginning of each repetition, all threads running will synchronize by use of a barrier. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is then submitted to the engine similar to the submission of a single descriptor. Further steps then follow the example from Section \ref{sec:state:dml} again. \par
+The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. These timings are marked with yellow background in Figure \ref{fig:benchmark-function}. To get accurate results, the benchmark is repeated multiple times. At the beginning of each repetition, all threads running will synchronize by use of a barrier. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is then submitted to the engine similar to the submission of a single descriptor. Further steps then follow the example from Section \ref{sec:state:dml} again. \par

 \section{Benchmarks}

@ -23,7 +23,7 @@ In this section we will describe three benchmarks. Each complete with setup info
 \subsection{Submission Method}
 \label{subsec:perf:submitmethod}

-With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching and single submission, we will evaluate at which data size which submission method makes sense. We expect single submission to perform worse consistently, with a pronounced effect on smaller transfer sizes. This is assumed, as the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, while the batch only sees this overhead once for multiple copies. \par
+With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching and single submission, we will evaluate at which submission method performs best for each tested data size. We expect single submission to perform worse consistently, with a pronounced effect on smaller transfer sizes. This is assumed, as the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, while the batch only sees this overhead once for multiple copies. \par

 \begin{figure}[h]
    \centering
@ -39,7 +39,7 @@ Another limitation may be observed in this result, namely the inherent throughpu
 \subsection{Multithreaded Submission}
 \label{subsec:perf:mtsubmit}

-As we might encounter access to one \gls{dsa} from multiple threads through the \glsentrylong{dsa:swq} associated with it, determining the effect this type of access has is important. We benchmark multithreaded submission for one, two and twelve threads. Each configuration gets the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this with sizes of 1 MiB and 1 GiB to see, if the behaviour changes with submission size. As for smaller sizes, the completion time may be faster than submission time. Therefore, smaller task sizes may see different effects of threading due to the fact that multiple threads can work to fill the queue and ensure there is never a stall due it being empty. On the other hand, we might experience lower-than-peak throughput, caused by the synchronization inherent with \gls{dsa:swq}. \par
+As we might encounter access to one \gls{dsa} from multiple threads through the \glsentrylong{dsa:swq} associated with it, determining the effect this type of access has is important. We benchmark multithreaded submission for one, two and twelve threads. The number twelve representing the core count of one node on the test system. Each configuration gets the same 120 copy tasks split across the available threads, all submitting to one \gls{dsa}. We perform this with sizes of 1 MiB and 1 GiB to see, if the behaviour changes with submission size. As for smaller sizes, the completion time may be faster than submission time. Therefore, smaller task sizes may see different effects of threading due to the fact that multiple threads can work to fill the queue and ensure there is never a stall due it being empty. On the other hand, we might experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par

 \begin{figure}[h]
    \centering
@ -55,7 +55,7 @@ In Figure \ref{fig:perf-mtsubmit} we see that threading has no negative impact.

 Moving data from DDR to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. We write to \gls{hbm} with its theoretical peak of 256 GB/s \cite[Table I]{hbm-arch-paper}. Our top speed is therefore limited by the slower main memory. For each node, the test system is configured with two DIMMs of DDR5-4800. We calculate the theoretical throughput as follows: \(2\ DIMMs * \frac{4800\ Megatransfers}{Second\ and\ DIMM} * \frac{64-bit width}{8 bits/byte} = 76800\ MT/s = 75\ GiB/s\)\todo{how to verify this calculation with a citation? is it even correct because we see close to 100 GiB/s throughput?}. We conclude that to achieve closer-to-peak speeds, a copy task has to be split across multiple \gls{dsa}. \par

-Two methods of splitting will be evaluated. The first being a brute force approach, utilizing all available for any transfer direction. The seconds' behaviour depends on the data source and destination locations. As our system has multiple sockets, communication crossing the socket could experience a latency and bandwidth disadvantage. We argue that for intra-socket transfers, use of the \gls{dsa} from the second socket will have only marginal effect. For transfers crossing the socket, every \gls{dsa} chip will perform badly, and we choose to only use the ones with the fastest possible access, namely the ones on the destination and source node. This might result in lower performance but also uses one fourth of the engines of the brute force approach for inter-socket and half for intra-socket. This gives other threads the chance to utilize these now-free chips. \par
+Two methods of splitting will be evaluated. The first being a brute force approach, utilizing all available for any transfer direction. The seconds' behaviour depends on the data source and destination locations. As our system has multiple sockets, communication crossing the socket could experience a latency and bandwidth disadvantage \todo{cite this}. We argue that for intra-socket transfers, use of the \gls{dsa} from the second socket will have only marginal effect. For transfers crossing the socket, every \gls{dsa} chip will perform badly, and we choose to only use the ones with the fastest possible access, namely the ones on the destination and source node. This might result in lower performance but also uses one fourth of the engines of the brute force approach for inter-socket and half for intra-socket. This gives other threads the chance to utilize these now-free chips. \par

 \begin{figure}[h]
    \centering
@ -80,7 +80,7 @@ For this benchmark, we copy 1 Gibibyte of data from node 0 to the destination no
    \label{fig:perf-peak-smart}
 \end{figure}

-From the brute force approach in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This invalidates our assumption, that peak bandwidth would be achieved in the intra-socket scenario \todo{find out why maybe?} and goes against the calculated peak throughput of the memory on our system \todo{calculation wrong? other factors?}. While using significantly more resources, the brute force copy shown in Figure \ref{fig:perf-peak-brute} outperforms the smart approach from Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa} of 2 GiB/s for copy to Node 8, 18 GiB/s for Node 11 and 12 and 30 GiB/s for Node 15. From this we conclude that the smart copy assignment is worth to use. Even though it affects the peak performance of one copy, the possibility of completing a second intra-socket copy at the same speed in parallel makes up for this loss in our eyes. \par
+From the brute force approach in Figure \ref{fig:perf-peak-brute}, we observe peak speeds of 96 GiB/s when copying across the socket from Node 0 to Node 15. This invalidates our assumption, that peak bandwidth would be achieved in the intra-socket scenario \todo{find out why maybe?} and goes against the calculated peak throughput of the memory on our system \todo{calculation wrong? other factors?}. While using significantly more resources, the brute force copy shown in Figure \ref{fig:perf-peak-brute} outperforms the smart approach from Figure \ref{fig:perf-peak-smart}. We observe an increase in transfer speed by utilizing all available \gls{dsa} of 2 GiB/s for copy to Node 8, 18 GiB/s for Node 11 and 12 and 30 GiB/s for Node 15. From this we conclude that the smart copy assignment is worth to use. Even though it affects the peak performance of one copy, the possibility of completing a second intra-socket copy at the same speed in parallel makes up for this loss in our eyes \todo{reformulate this sentence to make it less of a jumbled mess}. \par

 \subsection{Data Movement using CPU}

--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -40,19 +40,19 @@ The choice made in \ref{subsec:design:cache-entry-reuse} requires thread safe sh

 It was therefore decided to implement atomic reference counting for \texttt{CacheData} which means providing a custom constructor and destructor wherein a shared (through a standard pointer however) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to increase or deacrease the reference counter and, in case of decrease in the destructor signals that the destructor is called for the last reference, perform actual destruction. The invalid state of \texttt{CacheData} achievable is also avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer and the pointer to the cache memory be atomic too. Through this we may use the atomic wait operation which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par

-\begin{figure}[H]
+\begin{figure}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{images/sequenzdiagramm-waitoncompletion.png}
-    \caption{Sequence diagram for threading scenario in \texttt{CacheData::WaitOnCompletion}}
+    \caption{Sequence for Blocking Scenario}
    \label{fig:impl-cachedata-threadseq-waitoncompletion}
 \end{figure}

 Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Now \(T_1\) submits the work and sets the handlers (marked red lines in \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Now only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par

-\begin{figure}[H]
+\begin{figure}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{images/structo-cachedata-waitoncompletion.png}
-    \caption{Code Flow Diagram for \texttt{CacheData::WaitOnCompletion}}
+    \caption{\texttt{CacheData::WaitOnCompletion} Pseudocode}
    \label{fig:impl-cachedata-waitoncompletion}
 \end{figure}