begin processing todos for chapter 3, fix bad memory throughput calculation resulting in more todos for adaption of evaluation

use new gls entries for nvram and remote memory in chapters 6 and 7
start processing todos for abstract,intro and state | introduce glossary entries for nvram and remote mem, add image for hbm design layout, other smaller changes like rewrites
27 changed files with 97 additions and 79 deletions
--- a/benchmarks/benchmark-plots/plot-1dsa-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-1dsa-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-2dsa-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-2dsa-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-4dsa-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-4dsa-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-8cpu-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-8cpu-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-8dsa-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-8dsa-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-andrepeak-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-andrepeak-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-average-throughput.pdf
+++ b/benchmarks/benchmark-plots/plot-average-throughput.pdf
--- a/benchmarks/benchmark-plots/plot-dsa-throughput-scaling.pdf
+++ b/benchmarks/benchmark-plots/plot-dsa-throughput-scaling.pdf
--- a/benchmarks/benchmark-plotters/plot-perf-peakthroughput-bar.py
+++ b/benchmarks/benchmark-plotters/plot-perf-peakthroughput-bar.py
@ -68,12 +68,13 @@ def plot_bar(table,node_config,display_x,display_y):

    sns.barplot(x=x_label, y=y_label, data=table, palette="mako", errorbar="sd")

-    plt.ylim(0, 75)
+    plt.ylim(0, 70)
+    plt.yticks([15,30,45,60,65])
    plt.xlabel(display_x)
    plt.ylabel(display_y)

    plt.savefig(os.path.join(output_path, f"plot-{node_config}-throughput.pdf"), bbox_inches='tight')
-    plt.show()
+

 def PlotAndrePeakResults():
    data_peakbench_andre = [
@ -147,10 +148,11 @@ if __name__ == "__main__":
    ]

    scaling_df = pd.DataFrame(data_scaling)
-    sns.lineplot(x=x_dsacount, y=y_scaling, data=scaling_df, marker='o', linestyle='-', color='b', markersize=8)
-
+    plt.figure(figsize=(2, 3))
+    fig = sns.lineplot(x=x_dsacount, y=y_scaling, data=scaling_df, marker='o', linestyle='-', color='b', markersize=8)
+    plt.xticks([1,2,4,8])
+    plt.yticks([0.25,0.5,0.75,1.0])
    plt.xlim(0,10)
-    plt.ylim(0,1.25)
+    plt.ylim(0.2,1.05)
    plt.savefig(os.path.join(output_path, f"plot-dsa-throughput-scaling.pdf"), bbox_inches='tight')
-    plt.show()

--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/02_abstract.tex
+++ b/thesis/content/02_abstract.tex
@ -8,9 +8,7 @@
 % geben (für irgendetwas müssen die Betreuer ja auch noch da
 % sein).

-This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile implementation of a cache, offloading copy operations to the DSA.
-
-\todo{double "offloading" maybe reformulate the last sentence}
+This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the \gls{dsa}.

 %%% Local Variables:
 %%% TeX-master: "diplom"
--- a/thesis/content/10_introduction.tex
+++ b/thesis/content/10_introduction.tex
@ -12,15 +12,13 @@
 % den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
 % als 10 Seiten liest keiner.

-The proliferation of various technologies, such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}.  \par
+The proliferation of various technologies, such as \gls{nvram}, \gls{hbm}, and \gls{remotemem}, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}.  \par

 The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the architecture of the Intel \gls{dsa}. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. \par

-This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. Its design and implementation make up a large part of this work. This resulted in an architecture applicable to any use case requiring \glsentryshort{numa}-aware data movement with offloading support to the \gls{dsa}, while giving thread safety guarantees. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. As of the time of writing, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system and provide benchmarks for programming through the \glsentryfirst{intel:dml}. Furthermore, performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm} using \gls{dsa} has, to our knowledge, not yet been evaluated by the scientific community. \par
+This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. Its design and implementation make up a large part of this work. This resulted in an architecture applicable to any use case requiring \glsentryshort{numa}-aware data movement with offloading support to the \gls{dsa}. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. To our knowledge, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system, provide benchmarks for programming through the \glsentryfirst{intel:dml} and evaluate performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm}. \par

-We begin the work by furnishing the reader with pertinent technical information necessary for understanding the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Methodologies are presented, each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par
-
-\todo{shorten intro by two lines so that it fits on one page}
+We begin the work by furnishing the reader with pertinent technical information necessary for understanding the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"
--- a/thesis/content/20_state.tex
+++ b/thesis/content/20_state.tex
@ -34,10 +34,14 @@ This chapter introduces the relevant technologies and concepts for this thesis.
 \section{\glsentrylong{hbm}}
 \label{sec:state:hbm}

-\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
+\begin{figure}[!t]
+    \centering
+    \includegraphics[width=0.9\textwidth]{images/hbm-design-layout.png}
+    \caption{\glsentrylong{hbm} Design Layout. Shows that an \glsentryshort{hbm}-module consists of stacked DRAM and a logic die. \cite{amd:hbmoverview}}
+    \label{fig:hbm-layout}
+\end{figure}

-\todo{find a nice graphic for hbm}
-\todo{potential for futurework: analyze performance of using cache-mode}
+\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. As visible in Figure \ref{fig:hbm-layout}, it consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, \enquote{HBM Flat Mode} and \enquote{HBM Cache Mode} \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par

 \section{\glsentrylong{qdp}}
 \label{sec:state:qdp}
@ -81,7 +85,7 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th

 \subsubsection{Virtual Address Resolution}

-An important aspect of modern computer systems is the separation of address spaces through virtual memory \todo{maybe add cite and reformulate}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory \todo{of what}, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
+An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par

 \subsubsection{Completion Signalling}
 \label{subsubsec:state:completion-signal}
@ -102,13 +106,9 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
    \label{fig:dsa-software-arch}
 \end{figure}

-Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc} and other operating systems. Therefore, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user-space toolset can be utilized. This application provides a command-line interface and can read configuration files to set up the device. The interaction is depicted in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, visible in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. After successful configuration, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
-
-With the appropriate file permissions, a process could submit work to the \gls{dsa} using either the \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manual configuration. However, this process can be cumbersome, which is why \gls{intel:dml} exists. \par
-
-With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par
+Since the Linux Kernel version 5.10, a driver for the \gls{dsa} has been available, which currently lacks a counterpart on Windows Operating Systems \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par

-\todo{reformulate section}
+While a process could theoretically submit work to the \gls{dsa} using either the \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing descriptors through manual configuration, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par

 \section{Programming Interface for \glsentrylong{dsa}}
 \label{sec:state:dml}
@ -126,13 +126,13 @@ In the function header of Figure \ref{fig:dml-memcpy} two differences from stand

 The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par

-Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Section \ref{subsection:dsa-hwarch} \todo{not really observed there, cite maybe}, the node ID is equivalent to the ID of the \gls{dsa}. \par
+Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or, or by using optional parameters of \texttt{dml::submit} \cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on the given node. With it only being an example, potential side effects, arising from modification of \glsentryshort{numa}-assignment, of calling this pseudocode are not relevant. \par

 \gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par

 In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par

-A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}
+A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}. \par

 \section{System Setup and Configuration} \label{sec:state:setup-and-config}

--- a/thesis/content/30_performance.tex
+++ b/thesis/content/30_performance.tex
@ -1,11 +1,7 @@
 \chapter{Performance Microbenchmarks}
 \label{chap:perf}

-In this chapter, we measure the performance of the \gls{dsa}, with the goal to determine an effective utilization strategy to apply the \gls{dsa} to \gls{qdp}. In Section \ref{sec:perf:method} we lay out our benchmarking methodology, then perform benchmarks in \ref{sec:perf:bench} and finally summarize our findings in \ref{sec:perf:analysis}. \par
-
-The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}. Therefore, we will perform only a limited amount of benchmarks with the purpose of verifying the figures from \cite{intel:analysis} and analysing best practices and restrictions for applying \gls{dsa} to \gls{qdp}. \par
-
-\todo{reformulate}
+In this chapter, we measure the performance of the \gls{dsa}, with the goal to determine an effective utilization strategy to apply the \gls{dsa} to \gls{qdp}. In Section \ref{sec:perf:method} we lay out our benchmarking methodology, then perform benchmarks in \ref{sec:perf:bench} and finally summarize our findings in \ref{sec:perf:analysis}. As the performance of the \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}, we will perform only a limited amount of benchmarks with the purpose of determining behaviour in a multi-socket system, penalties from using \gls{intel:dml} and throughput for transfers from \glsentryshort{dram} to \gls{hbm}. \par

 \section{Benchmarking Methodology}
 \label{sec:perf:method}
@ -17,20 +13,14 @@ The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper e
    \label{fig:perf-xeonmaxnuma}
 \end{figure}

-The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 nodes that have access to 16 GiB of \gls{hbm} and 12 cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \par
-
-\todo{refer to intel ark as cite}
-
-As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s, we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
+The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 \glsentryshort{numa:node}s that have access to 16 GiB of \gls{hbm} and 12 cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \cite{intel:xeonmax-ark} \par

-\todo{refer back to state 2.4 where this is mentioned}
+As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s (see Section \ref{sec:state:dml}), we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par

-The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
+The benchmark performs \gls{numa:node} setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par

 Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par

-\todo{mention configuration used, potential for futurework: evaluate multiengine per group performance}
-
 \begin{figure}[!t]
    \centering
    \includegraphics[width=0.35\textwidth]{images/nsd-benchmark.pdf}
@ -38,9 +28,7 @@ Timing in the outer loop may display lower throughput than actual. This is the c
    \label{fig:benchmark-function:outer}
 \end{figure}

-To get accurate results, the benchmark is repeated 10 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. \par
-
-\todo{use concrete amount instead of "configurable"}
+To get accurate results, the benchmark is repeated \(10\) times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. The chosen internal repetition count is \(10.000\) for transfers in the range of \(1-8\ KiB\), \(1.000\) for \(1\ MiB\) and one for larger instances. \par

 \begin{figure}[!t]
    \centering
@ -68,52 +56,42 @@ We anticipate that single submissions will consistently yield poorer performance
 \begin{figure}[!t]
    \centering
    \includegraphics[width=0.5\textwidth]{images/plot-submitmethod.pdf}
-    \caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being node 0, executed by the \glsentryshort{dsa} on node 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
+    \caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the \glsentryshort{dsa} on \glsentryshort{numa:node} 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
    \label{fig:perf-submitmethod}
 \end{figure}

-In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. observed that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[pp. 6]{intel:analysis}. We however observe a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. \par
-
-\todo{submission method still makes a difference for 1mib, therefore recommend even larger transfers}
-
-Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
+In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of \(1\ MiB\) and upwards, the cost of single submission drops. As there is still a slight difference, datum size should be even larger. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. observed that \enquote{SWQ observes lower throughput between \(1-8\ KB\) [transfer size]} \cite[pp. 6]{intel:analysis}. We however observe a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to \(30\ GiB/s\). This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par

 \subsection{Multithreaded Submission}
 \label{subsec:perf:mtsubmit}

-As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads, with the latter representing the core count of one processing sub-node on the test system. We spawn multiple threads, all submitting to one \gls{dsa}. Furthermore, we perform this benchmark with sizes of 1 MiB and 1 GiB to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
+As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads, with the latter representing the core count of one processing sub-node on the test system. We spawn multiple threads, all submitting to one \gls{dsa}. Furthermore, we perform this benchmark with sizes of \(1\ MiB\) and \(1\ GiB\) to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par

 \begin{figure}[!t]
    \centering
    \includegraphics[width=0.5\textwidth]{images/plot-mtsubmit.pdf}
-    \caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being node 0, executed by the DSA on node 0.}
+    \caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the DSA on \glsentryshort{numa:node} 0.}
    \label{fig:perf-mtsubmit}
 \end{figure}

-In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible negative impact. The synchronization appears to affect single-threaded access in the same manner as it does for multiple threads. Interestingly, for the smaller size of 1 MiB, our assumption proved accurate, and performance increased with the addition of threads, which we attribute to enhanced queue usage. We ascribe the higher throughput observed with 1 GiB to the submission delay which is incurred more frequently with lower transfer sizes. \par
+In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible negative impact. The synchronization appears to affect single-threaded access in the same manner as it does for multiple threads. Interestingly, for the smaller size of \(1\ MiB\), our assumption proved accurate, and performance increased with the addition of threads, which we attribute to enhanced queue usage. We ascribe the higher throughput observed with \(1\ GiB\) to the submission delay which is incurred more frequently with lower transfer sizes. \par

 \subsection{Data Movement from \glsentryshort{dram} to \glsentryshort{hbm}}
 \label{subsec:perf:datacopy}

-Moving data from \gls{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. As we discovered in Section \ref{subsec:perf:submitmethod}, one \gls{dsa} has a peak bandwidth limit of 30 GiB/s. For each node, the test system is configured with two DIMMs of DDR5-4800. \par
-
-The naming scheme contains the data rate in Megatransfers per second. We calculate the transfers performed per second. \cite{kingston:ddr5-spec-overview} \par
+Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. With \gls{hbm} offering higher bandwidth than the \glsentryshort{dram} of our system, we will be restricted by the available bandwidth of the source. To determine the upper limit achievable, we must calculate the available peak bandwidth. For each \gls{numa:node}, the test system is configured with two DIMMs of DDR5-4800. The naming scheme contains the data rate in Megatransfers per second, however the processor specification notes that, for dual channel operation, the maximum supported speed drops to \(4400\ MT/s\) \cite{intel:xeonmax-ark}. We calculate the transfers performed per second for one \gls{numa:node}, followed by the bytes per transfer \cite{kingston:ddr5-spec-overview} and at last combine these two for the theoretical peak bandwidth per \gls{numa:node} on the system.  \par

-\[2\ DIMM * \frac{4800\ MT}{s\ *\ DIMM} = 9600\ MT/s\]
+\[2\ DIMM \times \frac{4400\ MT}{s\ \times\ DIMM} = 8800\ MT/s\]

-The data width of DDR5 is 64 bit. We calculate the amount of Bytes per Transfer. \cite{kingston:ddr5-spec-overview} \par
+\[\frac{64b}{8b/B}\ /\ T = 8\ B/T\]

-\[\frac{64b}{8b/B}\ /\ T = 8B\ /\ T\]
+\[8800\ MT/s \times 8B/T = 70400 \times 10^6 B/s = 65.56\ GiB/s\]

-Using the results from the previous calculations, we are now able to calculate the theoretical peak throughput speed per Node on our test system. \par
+From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56 GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par

-\[9600\ MT/s * 8B/T = 75\ GiB/s\]
+To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par

-We conclude that to achieve peak speeds, a copy task has to be split across multiple \gls{dsa}s \todo{mention why we conclude this}. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. \par
-
-To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilise the ones found on data source and destination node. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
-
-For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node. We present data for nodes 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
+For this benchmark, we transfer \(1\ GiB\)ibyte of data from \gls{numa:node} 0 to the destination \gls{numa:node}. We present data for \gls{numa:node}s 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the \gls{numa:node} IDs of the configured systems and the corresponding storage technology. \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, making it the physically closest possible destination. \gls{numa:node} 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. \gls{numa:node}s 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par

 \begin{figure}[!t]
    \centering
@ -148,39 +126,41 @@ For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destinatio
    \label{fig:perf-dsa}
 \end{figure}

-We begin by examining the common behaviour of load balancing techniques depicted in Figure \ref{fig:perf-dsa}. The real-world peak throughput approaches nearly 64 GiB/s, aligning with the maximum achievable with the CPU, as demonstrated in Section \ref{subsec:perf:cpu-datacopy}. In Figure \ref{fig:perf-dsa:1}, a notable hard bandwidth limit is observed, just below the 30 GiB/s mark, reinforcing what was encountered in Section \ref{subsec:perf:submitmethod}: a single \gls{dsa} is constrained by I/O-Fabric limitations. \par
+We begin by examining the common behaviour of load balancing techniques depicted in Figure \ref{fig:perf-dsa}. The real-world peak throughput of \(64\ GiB/s\) approaches the calculated available bandwidth. In Figure \ref{fig:perf-dsa:1}, a notable hard bandwidth limit is observed, just below the \(30\ GiB/s\) mark, reinforcing what was encountered in Section \ref{subsec:perf:submitmethod}: a single \gls{dsa} is constrained by I/O-Fabric limitations. \par

 Unexpected throughput differences are evident for all configurations, except the bandwidth-bound single \gls{dsa}. Notably, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11. As \gls{numa:node} 8 serves as the \gls{hbm} accessor for the data source \gls{numa:node}, it should have the shortest data path. This suggests that the \gls{dsa} may suffer from sharing parts of the data path for reading and writing. Another interesting observation is that, contrary to our assumption, the physically more distant \gls{numa:node} 15 achieves higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine this behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par

 For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par

-\todo{ensure consistent usage of gls for node}
-
 \begin{figure}[!t]
    \centering
-    \begin{subfigure}{0.225\textwidth}
+    \begin{subfigure}[t]{0.35\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/plot-average-throughput.pdf}
        \caption{Average Throughput for different amounts of participating \gls{dsa}.}
        \label{fig:perf-dsa-analysis:average}
    \end{subfigure}
    \hspace{5mm}
-    \begin{subfigure}{0.55\textwidth}
+    \begin{subfigure}[t]{0.35\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/plot-dsa-throughput-scaling.pdf}
-        \caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \\ \(\frac{Throughput}{Basline\ Throughput} * \frac{1}{Utilization\ Factor}\) \\ with the baseline being Throughput for 1 \gls{dsa} and the utilization factor representing the factor of the amount of \gls{dsa}s being used over the baseline.}
+        \caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \(Throughput\ /\ Baseline \times 1\ /\ Utilization\) with the Baseline being Throughput for one \gls{dsa} and the Utilization representing the factor of the amount of \gls{dsa}s being used over the baseline.}
        \label{fig:perf-dsa-analysis:scaling}
    \end{subfigure}
    \caption{Scalability Analysis for different amounts of participating \gls{dsa}s. Displays the average throughput and the derived scaling factor. Shows that, although the throughput does increase with adding more accelerators, beyond two, the gained speed drops significantly. Calculated over the results from Figure \ref{fig:perf-dsa} and therefore applies to copies from \glsentryshort{dram} to \glsentryshort{hbm}.}
    \label{fig:perf-dsa-analysis}
 \end{figure}

+\todo{tp scaling shows assumption: limited benefit of adding more than 2 dsa}
+
 When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa:2}, performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa-analysis:scaling}, using Brute-Force still leads to a slight increase in overall throughput, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \todo{we state that it decreases but then that it increases} \par

 From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par

 The choice of a load balancing method is not trivial. If peak throughput of one task is of relevance, Brute-Force for inter-socket and four \gls{dsa}s for intra-socket operation result in the fastest transfers. At the same time, these cause high system utilization, making them unsuitable for situations where multiple tasks may be submitted. For these cases, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling — see Figure \ref{fig:perf-dsa-analysis:scaling}. \par

+\todo{dont choose brute here, 4dsa is best}
+
 \subsection{Data Movement using CPU}
 \label{subsec:perf:cpu-datacopy}

@ -194,7 +174,7 @@ For evaluating CPU copy performance we use the benchmark code from the previous
        \caption{DML code for allnodes running on software path.}
        \label{fig:perf-cpu:swpath}
    \end{subfigure}
-    \hspace{8mm}
+    \hspace{5mm}
    \begin{subfigure}[t]{0.35\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/plot-andrepeak-throughput.pdf}
@ -217,7 +197,7 @@ In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-no
 In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline \todo{we dont do this, either write it or leave this out}. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm} \todo{weird wording}. \par

 \begin{itemize}
-    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under 1 MiB in size require batching and still do not reach peak performance. Task size should therefore be at or above 1 MiB and otherwise use the CPU \todo{why otherwise cpu, no direct link given}.
+    \item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\) and otherwise use the CPU \todo{why otherwise cpu, no direct link given}.
    \item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
    \item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
 \end{itemize}
@ -226,8 +206,6 @@ Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both

 We discovered an anomaly for \gls{numa:node} 12 for which we did not find an explanation. As the behaviour is also exhibited by the CPU, discovering the root issue falls outside the scope of this work. \par

-Scaling was found to be less than linear. No conclusive answer was given for this either. We assumed that this happens due to increased congestion of the interconnect. \par
-
 \todo{mention measures undertaken to find an explanation here}

 Even though we could not find an explanation for all measurements, this chapter still gives insight into the performance of the \gls{dsa}, its strengths and its weaknesses. It provides data-driven guidance on a complex architecture, helping to find the optimum for applying the \gls{dsa} to our expected and possibly different workloads. \par
--- a/thesis/content/60_evaluation.tex
+++ b/thesis/content/60_evaluation.tex
@ -15,7 +15,7 @@ In this chapter, we establish anticipated outcomes for incorporating the develop
 \section{Benchmarked Task}
 \label{sec:eval:bench}

-The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at 4 GiB. The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching for 5 iterations with 5 previous warm-up runs, and form the average. \par
+The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at \(4\ GiB\). The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching for 5 iterations with 5 previous warm-up runs, and form the average. \par

 The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. This burst-submission could cause the \gls{dsa}s work queue to overrun, leading us to increase the size of the blocks for the benchmarks utilizing \gls{qdp} to avoid this. \par

@ -107,7 +107,7 @@ In Section \ref{sec:eval:expectations}, we anticipated that the simple query wou

 The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par

-As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as Non-Volatile or Remote Memory, as mentioned in Chapter \ref{chap:intro}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to non-volatile memory for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
+As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram} or \gls{remotemem}, as mentioned in Chapter \ref{chap:intro}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to \gls{nvram} for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par


 %%% Local Variables:
--- a/thesis/content/70_conclusion.tex
+++ b/thesis/content/70_conclusion.tex
@ -24,7 +24,7 @@ In Section \ref{sec:eval:observations}, we observed adverse effects when prefetc

 As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, which currently only supports busy-waiting\todo{mention that busy waiting goes against the rationale of offloading data copy, only usefull to reduce power consumption for sync-copy, cite dsaanalysis for this}. \todo{we could also mention dwq to reduce overhead of cache and therefore time spent in scanb} \par

-Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from remote memory to faster local storage or replicating data to Non-Volatile RAM. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its applicability. \par
+Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from \gls{remotemem} to faster local storage or replicating data to \gls{nvram}. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its applicability. A performance comparison between prefetching to \gls{hbm} using knowledge of the coming queries and the data they access, and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could also yield interesting insights. \par

 In conclusion, the \texttt{Cache}, together with the Sections on the \gls{dsa}'s architecture (Section \ref{sec:state:dsa}) and performance characteristics (Section \ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par

--- a/thesis/images/hbm-design-layout.png
+++ b/thesis/images/hbm-design-layout.png
--- a/thesis/images/plot-1dsa-throughput.pdf
+++ b/thesis/images/plot-1dsa-throughput.pdf
--- a/thesis/images/plot-2dsa-throughput.pdf
+++ b/thesis/images/plot-2dsa-throughput.pdf
--- a/thesis/images/plot-4dsa-throughput.pdf
+++ b/thesis/images/plot-4dsa-throughput.pdf
--- a/thesis/images/plot-8cpu-throughput.pdf
+++ b/thesis/images/plot-8cpu-throughput.pdf
--- a/thesis/images/plot-8dsa-throughput.pdf
+++ b/thesis/images/plot-8dsa-throughput.pdf
--- a/thesis/images/plot-andrepeak-throughput.pdf
+++ b/thesis/images/plot-andrepeak-throughput.pdf
--- a/thesis/images/plot-average-throughput.pdf
+++ b/thesis/images/plot-average-throughput.pdf
--- a/thesis/images/plot-dsa-throughput-scaling.pdf
+++ b/thesis/images/plot-dsa-throughput-scaling.pdf
--- a/thesis/own.bib
+++ b/thesis/own.bib
@ -234,3 +234,30 @@
    howpublished = {\url{https://timur.audio/dwcas-in-c}},
    urldate = {2024-02-07}
 }
+
+@misc{amd:hbmoverview,
+    author = {AMD},
+    title = {{High-Bandwidth Memory (HBM)}},
+    urldate = {2024-02-14},
+    howpublished = {\url{https://www.amd.com/system/files/documents/high-bandwidth-memory-hbm.pdf}}
+}
+
+@article{virtual-memory,
+    author = {Peter J. Denning},
+    title = {{Virtual Memory}},
+    date = {1996-03},
+    publisher = {Association for Computing Machinery},
+    volume = {28},
+    number = {1},
+    doi = {10.1145/234313.234403},
+    journal = {ACM Computing Surveys},
+    pages = {213–216},
+    numpages = {4}
+}
+
+@misc{intel:xeonmax-ark,
+    author = {Intel},
+    title = {{Intel® Xeon® CPU Max 9468 Processor}},
+    urldate = {2024-02-14},
+    howpublished = {\url{https://ark.intel.com/content/www/us/en/ark/products/232596/intel-xeon-cpu-max-9468-processor-105m-cache-2-10-ghz.html}}
+}
--- a/thesis/own.gls
+++ b/thesis/own.gls
@ -135,7 +135,8 @@
 }

 \newglossaryentry{mempress}{
-    name={memory pressure},
+    short={memory pressure},
+    name={Memory Pressure},
    description={... desc ...}
 }

@ -145,4 +146,18 @@
    long={Application Programming Interface},
    first={Application Programming Interface (API)},
    description={... desc ...}
+}
+
+\newglossaryentry{remotemem}{
+    short={Remote Memory},
+    name={Remote Memory},
+    description={... desc ...}
+}
+
+\newglossaryentry{nvram}{
+    short={NVRAM},
+    name={NVRAM},
+    long={Non-Volatile RAM},
+    first={Non-Volatile RAM (NVRAM)},
+    description={... desc ...}
 }
Author	SHA1	Message	Date
Constantin Fürst	49d5a61a2c	begin processing todos for chapter 3, fix bad memory throughput calculation resulting in more todos for adaption of evaluation	3 months ago
Constantin Fürst	7dbdbae7dc	use new gls entries for nvram and remote memory in chapters 6 and 7	3 months ago
Constantin Fürst	ce15b7e203	start processing todos for abstract,intro and state \| introduce glossary entries for nvram and remote mem, add image for hbm design layout, other smaller changes like rewrites	3 months ago
Constantin Fürst	436ce28804	improve axis scaling and ticks for benchmark result plots	3 months ago