% geben (für irgendetwas müssen die Betreuer ja auch noch da
% sein).
This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM) and High Bandwidth Memory (HBM). Systems equipped with more than one type of main memory or employing a Non-Uniform Memory Architecture (NUMA) necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the DSA.
This bachelor thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM and High Bandwidth Memory. Systems equipped with more than one type of main memory or employing a Non-Uniform Memory Architecture necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the DSA.
% den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
% als 10 Seiten liest keiner.
The proliferation of various technologies, such as \gls{nvram} and \gls{hbm}, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Extending traditional \glsentryfirst{numa}, these systems necessitate the movement of data across memory classes and locations to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}. \par
The proliferation of various technologies, such as \gls{nvram} and \gls{hbm}, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Extending traditional \gls{numa}, these systems necessitate the movement of data across memory classes and locations to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This work undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}. \par
The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the architecture of the Intel \gls{dsa}. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. \par
The primary objective of this thesis is twofold. Firstly, it involves a comprehensive analysis and characterization of the architecture of the Intel \gls{dsa}. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \gls{qdp} to accelerate database queries \cite{dimes-prefetching}. \par
This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. Its design and implementation make up a large part of this work. This resulted in an architecture applicable to any use case requiring \glsentryshort{numa}-aware data movement with offloading support to the \gls{dsa}. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. To our knowledge, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system, provide benchmarks for programming through the \glsentryfirst{intel:dml} and evaluate performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm}. \par
We introduce significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. The development efforts resulted in an architecture applicable to any use case requiring \gls{numa}-aware data movement with offloading support to the \gls{dsa}. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. To our knowledge, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system, provide benchmarks for programming through the \glsentryfirst{intel:dml}, and evaluate performance for data movement from \glsentryshort{dram} to \gls{hbm}. \par
We begin the work by furnishing the reader with pertinent technical information necessary for understanding the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par
We begin the work by furnishing the reader with pertinent technical information necessary to understand the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par
% Kapitel wird in der Regel zuerst geschrieben und ist das Einfachste
% (oder das Schwerste weil erste).
This chapter introduces the technologies and concepts, relevant for the understanding of this thesis. The goal of this thesis is to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, therefore we will familiarize ourselves with both. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers.
This chapter introduces the technologies and concepts, relevant for the understanding of this thesis. With the goal being to apply the \glsentrylong{dsa} to the concept of \glsentrylong{qdp}, we will familiarize ourselves with both. We also give background on \glsentrylong{hbm}, which is a secondary memory technology to the \glsentryshort{dram} used in current computers. This chapter also contains an overview of the \glsentrylong{intel:dml}, used to interact with the \gls{dsa}. \par
\section{\glsentrylong{hbm}}
\label{sec:state:hbm}
@ -41,52 +41,52 @@ This chapter introduces the technologies and concepts, relevant for the understa
\label{fig:hbm-layout}
\end{figure}
\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. As visible in Figure \ref{fig:hbm-layout}, it consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably,\enquote{HBM Flat Mode} and \enquote{HBM Cache Mode}\cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\glsentrylong{hbm} is an emerging technology that promises an increase in peak bandwidth. As visible in Figure \ref{fig:hbm-layout}, it consists of stacked \glsentryshort{dram} dies, situated above a logic die through which the system accesses the memory\cite[p. 1]{hbm-arch-paper}. \gls{hbm} is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. On these systems, different memory modes can be configured, most notably\enquote{HBM Flat Mode} and \enquote{HBM Cache Mode}\cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\caption{Illustration of a simple query in (a) and the corresponding pipeline in (b). \cite[Fig. 1]{dimes-prefetching}}
\label{fig:qdp-simple-query}
\end{figure}
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching}\par
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory to accelerate subsequent operations by reducing the potential memory bottleneck. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching}\par
Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks, resulting in a high degree of concurrency. This increases demand for processing cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. Hence, our objective in this work is to offload the underlying copy operations to the \gls{dsa}, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \cite{dimes-prefetching}\par
\gls{qdp} processes tasks in parallel and in chunks, resulting in a high degree of concurrency. This increases demand for processing cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. Hence, our objective in this work is to offload the underlying copy operations to the \gls{dsa}, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \cite{dimes-prefetching}\par
\section{\glsentrylong{dsa}}
\label{sec:state:dsa}
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication}\cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. \par
\caption{\glsentrylong{dsa} Internal Architecture. Shows the components that the chip is made up of, how they are connected and which outside components the \glsentryshort{dsa} communicates with. \cite[Fig. 1 (a)]{intel:analysis}}
\label{fig:dsa-internal-block}
\end{figure}
The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible and configurable as a PCIe device. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par
The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication\cite[Sec. 3]{intel:dsaspec}. Through this interface, the \gls{dsa} is accessible and configurable as a PCIe device\cite[Sec. 3.1]{intel:dsaspec}. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par
\subsubsection{Architectural Components}
\label{subsec:state:dsa-arch-comp}
\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:}\glsentryshort{dsa:wq}s provide the means to submit tasks to the device and are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions to which a descriptor is written, facilitating task submission. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes. The method used to achieve this guarantee may result in higher submission cost \cite[Sec. 3.3.1]{intel:dsaspec}, compared to the \gls{dsa:dwq} to which a descriptor is submitted via a regular write \cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation is contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. Processing a task descriptor is straightforward, as all information required to complete the operation is contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. Consequently, tasks from one \gls{dsa:wq} may be processed by one of multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects Engines and \glsentrylong{dsa:qw}s according to the user-defined configuration. \par
\subsubsection{Virtual Address Resolution}
\label{subsubsec:state:dsa-vaddr}
An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the instruction used by \gls{dsa:swq} submission \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \glsentrylong{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \glsentrylong{x86:pasid} which is filled by the instruction used by \gls{dsa:swq} submission \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\subsubsection{Completion Signalling}
\label{subsubsec:state:completion-signal}
@ -96,48 +96,50 @@ The status of an operation on the \gls{dsa} is available in the form of a record
\subsubsection{Ordering Guarantees}
\label{subsubsec:state:ordering-guarantees}
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one Engine in a Group when exclusively submitting batch or task descriptors but no mixture. Even in such cases, only write-ordering is guaranteed, implying that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. Challenges arise, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Consequently, caution is necessary in read-after-write scenarios. This can be addressed by either waiting for successful completion before submitting the dependent descriptor, inserting a drain descriptor for tasks, or setting the fence flag for a batch. The latter two methods inform the processing engine that all writes must be committed, and in case of the fence in a batch, to abort on previous error. \cite[Sec. 3.9]{intel:dsaspec}\par
Ordering guarantees enable or restrict the safe submission of tasks to the \gls{dsa} which depend on the completion of previous work on the accelerator. Guarantees are only given for a configuration with one \gls{dsa:wq} and one Engine in a Group when exclusively submitting batch or task descriptors but no mixture. Even in such cases, only write-ordering is guaranteed, implying that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. Consequently, when submitting a task mutating the value at address \(A\), followed by one reading from \(A\), the read may pass the write and therefore see the unmodified state of the data at \(A\). Additional challenges arise, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Consequently, caution is necessary in read-after-write scenarios. \par
Operation ordering hazards can be addressed by either waiting for successful completion before submitting the dependent descriptor, inserting a drain descriptor for tasks, or setting the fence flag for a batch. The latter two methods inform the processing engine that all writes must be committed, and in case of the fence in a batch, to abort on previous error. \cite[Sec. 3.9]{intel:dsaspec}\par
\caption{\glsentrylong{dsa} Software View. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission. \cite[Fig. 1 (b)]{intel:analysis}}
\label{fig:dsa-software-arch}
\end{figure}
At last, we will give an overview of the available software stack for interacting with the \gls{dsa}. Driver support is limited to Linux, where since Kernel version 5.10, a driver for the \gls{dsa} has been available \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa}is impossible under other operating systems. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application and library \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}, while also facilitating hardware discovery. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
At last, we will give an overview of the available software stack for interacting with the \gls{dsa}. Driver support is limited to Linux, where a driver for the \gls{dsa} has been available since Kernel version 5.10 \cite[Sec. Installation]{intel:dmldoc}. As a result, we consider accessing the \gls{dsa}under different operating systems unfeasible. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application and library \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read description files to configure the device (see Section \ref{subsection:dsa-hwarch} for information on this configuration), while also facilitating hardware discovery. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
While a process could theoretically submit work to the \gls{dsa} by manually preparing descriptors and submitting them via special instructions, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par
While a process could theoretically submit work to the \gls{dsa} by manually preparing descriptors and submitting them via special instructions, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. Section \ref{sec:state:dml} provides an overview and example pseudocode for usage of \gls{intel:dml}.\par
\section{Programming Interface for \glsentrylong{dsa}}
\label{sec:state:dml}
As mentioned in Section \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
\caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}. The template parameter \texttt{path} selects between hardware offloading (Intel \glsentryshort{dsa}) or software execution (CPU).}
\label{fig:dml-memcpy}
\end{figure}
In the function header of Figure \ref{fig:dml-memcpy} two differences from standard memcpy are notable. Firstly, there is the template parameter named \texttt{path}, and secondly, an additional parameter \texttt{int node}. The \texttt{path} allows selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favouring \gls{dsa} where available \cite[Sec. Quick Start]{intel:dmldoc}. Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or, or by using optional parameters of \texttt{dml::submit}\cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on \gls{numa:node} chosen by \texttt{int node}. With it only being an example, potential side effects, arising from modification of \glsentryshort{numa}-assignment, of calling this pseudocode are not relevant. \par
In the function header of Figure \ref{fig:dml-memcpy} two differences from standard memcpy are notable. Firstly, there is the template parameter named \texttt{path}, and secondly, an additional parameter \texttt{node}. The \texttt{path} allows selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favouring \gls{dsa} where available \cite[Sec. Quick Start]{intel:dmldoc}. Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or by using optional parameters of \texttt{dml::submit}\cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on the \gls{numa:node} chosen by \texttt{node}. Given that Figure \ref{fig:dml-memcpy} serves as an illustrative example, any potential side effects resulting from the modification of \gls{numa} assignments through the execution of this pseudocode are disregarded. \par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. Following this preparation, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. For submission, the function \texttt{dml::submit<path>} is used, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}.\par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. Following this preparation, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. For submission, the function \texttt{dml::submit<path>} is used, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API]{intel:dmldoc}\par
\section{System Setup and Configuration}\label{sec:state:setup-and-config}
In this section we provide a step-by-step guide to replicate the configuration being used for benchmarks and testing purposes in the following chapters. While Intel's guide on \gls{dsa} usage was a useful resource, we also consulted articles for setup on Lenovo ThinkSystem Servers for crucial information not present in the former. It is important to note that instructions for configuring the HBM access mode, as mentioned in Section \ref{sec:state:hbm}, may vary from system to system and can require extra steps not covered in the list below. \par
\begin{enumerate}
\item Set \enquote{Memory Hierarchy} to Flat \cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm}, \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide} and, if available, \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \cite[p. 5]{lenovo:dsa}
\item Use a kernel with IDXD driver support, available from Linux 5.10 or later \cite[Sec. Installation]{intel:dmldoc} and append the following to the kernel boot parameters in grub config: \texttt{intel\_iommu=on,sm\_on}\cite[p. 5]{lenovo:dsa}
\item Set \enquote{Memory Hierarchy} to Flat \cite[Sec. Configuring HBM, Configuring Flat Mode]{lenovo:hbm}, \enquote{VT-d} to Enabled in BIOS \cite[Sec. 2.1]{intel:dsaguide}, and, if available, \enquote{Limit CPU PA to 46 bits} to Disabled in BIOS \cite[p. 5]{lenovo:dsa}
\item Use a kernel with IDXD driver support, available from Linux 5.10 or later \cite[Sec. Installation]{intel:dmldoc}, and append the following to the kernel boot parameters in grub config: \texttt{intel\_iommu=on,sm\_on}\cite[p. 5]{lenovo:dsa}
\item Evaluate correct detection of \gls{dsa} devices using \texttt{dmesg | grep idxd} which should list as many devices as NUMA nodes on the system \cite[p. 5]{lenovo:dsa}
\item Install \texttt{accel-config} from repository\cite{intel:libaccel-config-repo} or system package manager and inspect the detection of \gls{dsa} devices through the driver using \texttt{accel-config list -i}\cite[p. 6]{lenovo:dsa}
\item Create \gls{dsa} configuration file for which we provide an example under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo} that is also applied for the benchmarks. Then apply the configuration using \texttt{accel-config load-config -c [filename] -e}\cite[Fig. 3-9]{intel:dsaguide}
\item Install \texttt{accel-config} from source\cite{intel:libaccel-config-repo} or system package manager and inspect the detection of \gls{dsa} devices through the driver using \texttt{accel-config list -i}\cite[p. 6]{lenovo:dsa}
\item Create \gls{dsa} configuration file for which we provide an example under \texttt{benchmarks/configuration-files/8n1d1e1w.conf} in the accompanying repository \cite{thesis-repo} that is also applied for the benchmarks. Apply the configuration using \texttt{accel-config load-config -c [filename] -e}\cite[Fig. 3-9]{intel:dsaguide}
\item Inspect the now configured \gls{dsa} devices using \texttt{accel-config list}\cite[p. 7]{lenovo:dsa}, output should match the desired configuration set in the file used
In this chapter, we measure the performance of the \gls{dsa}, with the goal to determine an effective utilization strategy to apply the \gls{dsa} to \gls{qdp}. In Section \ref{sec:perf:method} we lay out our benchmarking methodology, then perform benchmarks in \ref{sec:perf:bench} and finally summarize our findings in \ref{sec:perf:analysis}. As the performance of the \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}, we will perform only a limited amount of benchmarks with the purpose of determining behaviour in a multi-socket system, penalties from using \gls{intel:dml} and throughput for transfers from \glsentryshort{dram} to \gls{hbm}. \par
In this chapter, we measure the performance of the \gls{dsa}, with the goal to determine an effective utilization strategy to apply the \gls{dsa} to \gls{qdp}. In Section \ref{sec:perf:method} we lay out our benchmarking methodology, then perform benchmarks in \ref{sec:perf:bench} and finally summarize our findings in \ref{sec:perf:analysis}. As the performance of the \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}, we will perform only a limited amount of benchmarks with the purpose of determining behaviour in a multi-socket system, penalties from using \gls{intel:dml}, and throughput for transfers from \glsentryshort{dram} to \gls{hbm}. \par
\section{Benchmarking Methodology}
\label{sec:perf:method}
@ -13,35 +13,31 @@ In this chapter, we measure the performance of the \gls{dsa}, with the goal to d
\label{fig:perf-xeonmaxnuma}
\end{figure}
The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 \glsentryshort{numa:node}s that have access to 16 GiB of \gls{hbm} and 12 cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \cite{intel:xeonmax-ark}\par
The benchmarks were conducted on a dual-socket server equipped with two Intel Xeon Max 9468 CPUs, each with 4 \glsentryshort{numa:node}s that have access to 16 GiB of \gls{hbm} and 12 physical cores. This results in a total of 96 cores and 128 GiB of \gls{hbm}. The layout of the system is visualized in Figure \ref{fig:perf-xeonmaxnuma}. For configuring it, we follow Section \ref{sec:state:setup-and-config}. \cite{intel:xeonmax-ark}\par
As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s (see Section \ref{sec:state:dml}), we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be obtained in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s (see Section \ref{sec:state:dml}), we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be inspected in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
The benchmark performs \gls{numa:node}setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
The benchmark performs \gls{numa:node}-setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par
\caption{Outer Benchmark Procedure Pseudocode. Timing marked with yellow background. Showing preparation of memory locations, clearing of cache entries, timing points and synchronized benchmark launch.}
\label{fig:benchmark-function:outer}
\end{figure}
\todo{add more references to figure}
To get accurate results, the benchmark is repeated \(10\) times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in under \(100\ ms\). This short execution window can have adverse effects on the timings. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. The chosen internal repetition count is \(10.000\) for transfers in the range of \(1-8\ KiB\), \(1.000\) for \(1\ MiB\) and one for larger instances. \par
\caption{Inner Benchmark Procedure Pseudocode. Showing work submission for single and batch submission.}
\label{fig:benchmark-function:inner}
\end{figure}
\todo{add more references to figure}
To get accurate results, the benchmark is repeated \(10\) times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in under \(100\ ms\). This short execution window can have adverse effects on the timings. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. Figure \ref{fig:benchmark-function:inner} depicts this behaviour as the for-loop. The chosen internal repetition count is \(10000\) for transfers in the range of \(1-8\ KiB\), \(1000\) for \(1\ MiB\), and one for larger instances. \par
For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This selection is displayed in Figure \ref{fig:benchmark-function:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is submitted to the engine similar to the submission of a single descriptor. \par
For all \gls{dsa}s used in the benchmark, a submission thread executing the inner benchmark routine is spawned. The launch is synchronized by use of a barrier for each iteration. This is shown by the accesses to \enquote{LAUNCH\_BARRIER} in both Figure \ref{fig:benchmark-function:inner} and \ref{fig:benchmark-function:outer}. The behaviour in the inner function then differs depending on the submission method selected which can be a single submission or a batch of given size. This selection is displayed in Figure \ref{fig:benchmark-function:inner} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. A prepared sequence is submitted to the engine similarly to a single descriptor. \par
\section{Benchmarks}
\label{sec:perf:bench}
@ -53,23 +49,23 @@ In this section, we will introduce three benchmarks, providing setup information
With each submission, descriptors must be prepared and sent to the underlying hardware. This process is anticipated to incur a cost, impacting throughput sizes and submission methods differently. We submit different sizes and compare batching with single submissions, determining which combination of submission method and size is most effective. \par
We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences this overhead only once for multiple copies. \par
We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences it only once for multiple copies. \par
\caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the \glsentryshort{dsa} on \glsentryshort{numa:node} 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
\caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the \glsentryshort{dsa} on \glsentryshort{numa:node} 0. Observable is the submission cost which affects small transfer sizes differently.}
\label{fig:perf-submitmethod}
\end{figure}
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of \(1\ MiB\) and upwards, the cost of single submission drops. As there is still a slight difference, datum size should be even larger. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. noted that \enquote{SWQ observes lower throughput between \(1-8\ KB\) [transfer size]}\cite[pp. 6]{intel:analysis}. We however measured a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. Another limitation is visible in our result, namely the inherent throughput limit per \gls{dsa} chip of close to \(30\ GiB/s\). This is caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of \(1\ MiB\) and upwards, the cost of single submission drops. Even for transferring the largest datum in this benchmark, we still observe a speed delta between batching and single submission, pointing to substantial submission costs encountered for submissions to the \gls{dsa:swq} and task preparation. For smaller transfers the performance varies greatly, with batch operations leading in throughput. Reese Kuper et al. noted that \enquote{SWQ observes lower throughput between \(1-8\ KB\) [transfer size]}\cite[pp. 6]{intel:analysis}. We however measured a much higher point of equalization, pointing to additional delays introduced by programming the \gls{dsa} through \gls{intel:dml}. Another limitation is visible in our result, namely the inherent throughput limit per \gls{dsa} chip of close to \(30\ GiB/s\). This is caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
\subsection{Multithreaded Submission}
\label{subsec:perf:mtsubmit}
As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads, with the latter representing the core count of one processing sub-node on the test system. We spawn multiple threads, all submitting to one \gls{dsa}. Furthermore, we perform this benchmark with sizes of \(1\ MiB\) and \(1\ GiB\) to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
\caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being \glsentryshort{numa:node} 0, executed by the DSA on \glsentryshort{numa:node} 0.}
@ -81,15 +77,15 @@ In Figure \ref{fig:perf-mtsubmit}, we note that threading has no discernible neg
\subsection{Data Movement from \glsentryshort{dram} to \glsentryshort{hbm}}
\label{subsec:perf:datacopy}
Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. With \gls{hbm} offering higher bandwidth than the \glsentryshort{dram} of our system, we will be restricted by the available bandwidth of the source. To determine the upper limit achievable, we must calculate the available peak bandwidth. For each \gls{numa:node}, the test system is configured with two DIMMs of DDR5-4800. The naming scheme contains the data rate in Megatransfers (MT) per second, however the processor specification notes that, for dual channel operation, the maximum supported speed drops to \(4400\ MT/s\)\cite{intel:xeonmax-ark}. We calculate the transfers performed per second for one \gls{numa:node} (1), followed by the bytes per transfer \cite{kingston:ddr5-spec-overview} in calculation (2), and at last combine these two for the theoretical peak bandwidth per \gls{numa:node} on the system (3). \par
Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest of this work, as it is the target application. With \gls{hbm} offering higher bandwidth than the \glsentryshort{dram} of our system, we will be restricted by the available bandwidth of the source. To determine the upper limit achievable, we must calculate the available peak bandwidth. For each \gls{numa:node}, the test system is configured with two DIMMs of DDR5-4800. The naming scheme contains the data rate in Megatransfers (MT) per second, however the processor specification notes that for dual channel operation, the maximum supported speed drops to \(4400\ MT/s\)\cite{intel:xeonmax-ark}. We calculate the transfers performed per second for one \gls{numa:node} (1), followed by the bytes per transfer \cite{kingston:ddr5-spec-overview} in calculation (2), and at last combine these two for the theoretical peak bandwidth per \gls{numa:node} on the system (3). \par
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.225\textwidth}
\centering
@ -122,17 +118,17 @@ Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest
\label{fig:perf-dsa}
\end{figure}
From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56\ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages\cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56\ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple processors, data movement between sockets must also be evaluated, as additional bandwidth limitations may be encountered\cite{bench:heterogeneous-communication}. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available source memory. \par
To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
For this benchmark, we transfer \(1\ GiB\)ibyte of data from \gls{numa:node} 0 to the destination \gls{numa:node}. We present data for \gls{numa:node}s 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the \gls{numa:node} IDs of the configured systems and the corresponding storage technology. \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, making it the physically closest possible destination. \gls{numa:node} 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. \gls{numa:node}s 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
For this benchmark, we transfer \(1\ GiB\) of data from \gls{numa:node} 0 to the destination \gls{numa:node}. We present data for \gls{numa:node}s 8, 11, 12, and 15. To understand the selection, see Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the \gls{numa:node} IDs of the configured systems and the corresponding storage technology. \gls{numa:node} 8 accesses the \gls{hbm} on \gls{numa:node} 0, making it the physically closest possible destination. \gls{numa:node} 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. \gls{numa:node}s 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
We begin by examining the common behaviour of load balancing techniques depicted in Figure \ref{fig:perf-dsa}. The real-world peak throughput of \(64\ GiB/s\) approaches the calculated available bandwidth. In Figure \ref{fig:perf-dsa:1}, a notable hard bandwidth limit is observed, just below the \(30\ GiB/s\) mark, reinforcing what was encountered in Section \ref{subsec:perf:submitmethod}: a single \gls{dsa} is constrained by I/O-Fabric limitations. \par
Unexpected throughput differences are evident for all configurations, except the bandwidth-bound single \gls{dsa}. Notably, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11. As \gls{numa:node} 8 serves as the \gls{hbm} accessor for the data source \gls{numa:node}, it should have the shortest data path. This suggests that the \gls{dsa} may suffer from sharing parts of the data path for reading and writing. Another interesting observation is that, contrary to our assumption, the physically more distant \gls{numa:node} 15 achieves higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine this behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par
Unexpected throughput differences are evident for all configurations, except the bandwidth-bound single \gls{dsa}. Notably, \gls{numa:node} 8 performs worse than copying to \gls{numa:node} 11. As \gls{numa:node} 8 serves as the \gls{hbm} accessor for the data source \gls{numa:node}, the data path is the shortest tested. The lower throughput suggests that the \gls{dsa} may suffer from sharing parts of the path for reading and writing in this situation. Another interesting observation is that, contrary to our assumption, the physically more distant \gls{numa:node} 15 achieves higher throughput than the closer \gls{numa:node} 12. We lack an explanation for this anomaly and will further examine this behaviour in the analysis of the CPU throughput results in Section \ref{subsec:perf:cpu-datacopy}. \par
\begin{figure}[!t]
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.35\textwidth}
\centering
@ -151,15 +147,15 @@ Unexpected throughput differences are evident for all configurations, except the
\label{fig:perf-dsa-analysis}
\end{figure}
For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par
For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds, close to the calculated theoretical limit, when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par
From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par
From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the processors' interconnect, however, as no hard limit is encountered, this is not a definitive answer. Utilizing off-socket \gls{dsa}s proves disadvantageous in the average case, reinforcing the claim of communication overhead for crossing the socket boundary. \par
The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par
\begin{figure}[!t]
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.35\textwidth}
\centering
@ -192,16 +188,16 @@ In this section we summarize the conclusions from the performed benchmarks, outl
\begin{itemize}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\). Otherwise, offloading might prove more expensive than performing the copy on CPU.
\item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
\item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using a \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
\item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
\item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To still utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding led us to implement round-robin balancing in Section \ref{sec:impl:application}.
\item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding led us to implement round-robin balancing in Section \ref{sec:impl:application}.
\end{itemize}
\pagebreak
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computation. \par
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computational tasks. \par
We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in determining the optimal approach for optimal utilization of the \gls{dsa}. \par
We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, pointing to a problem in the system architecture, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in the determination of an optimal utilization strategy for the \gls{dsa}. \par
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde.
In this chapter, we formulate a class interface for a general-purpose cache. We will outline the requirements and elucidate the solutions employed to address them, culminating in the final architecture. Details pertaining to the implementation of this blueprint will be deferred to Chapter \ref{chap:implementation}, where we delve into a selection of relevant aspects. \par
In this chapter, we design a class interface for a general-purpose cache. We will outline the requirements and elucidate the solutions employed to address them, culminating in the final architecture. Details pertaining to the implementation of this blueprint will be deferred to Chapter \ref{chap:implementation}, where we delve into a selection of relevant aspects. \par
The target application of code contributed by this work is to accelerate \glsentrylong{qdp} by offloading copy operations to the \gls{dsa}. Prefetching is inherently related with cache functionality. Given that an application providing the latter offers a broader scope of utility beyond \gls{qdp}, we opted to implement an offloading \texttt{Cache}. \par
@ -29,32 +29,30 @@ The target application of code contributed by this work is to accelerate \glsent
\label{fig:impl-design-interface}
\end{figure}
\todo{add more references to the figure}
\pagebreak
\section{Interface}
\label{sec:design:interface}
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation is required when the data that is cached may also be modified, necessitating synchronization. Due to various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when the lack of free cache memory demands it. \par
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. Operation (3) is required when the data that is cached may also be modified, necessitating synchronization between cache and source memory. Due to various setups and use cases for the \texttt{Cache}, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush unused entries, when memory pressure during access of a new memory block demands it. \par
Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. \par
Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. Operations (1) and (2) are provided by one single function, which we call \texttt{Cache::Access}. This function receives a data pointer and size as parameters. Its behaviour depends on the state of the cache containing an entry for the requested block, only submitting a caching operation if the pointer received is not yet cached and returning the cache entry in both cases. Operations \texttt{Cache::Access} and \texttt{Cache::Invalidate}, and additional methods, deemed useful but not covered by the behavioural requirements, of the \texttt{Cache}-class can be viewed in Figure \ref{fig:impl-design-interface}. \par
Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an class referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData}on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations and allowing work to continue in parallel. To facilitate this process, the \texttt{Cache::Access} method returns an instance of the class \texttt{CacheData}. Figure \ref{fig:impl-design-interface}also documents the public interface for \texttt{CacheData}in the left block. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep. A call to \texttt{CacheData::WaitOnCompletion} is required should the user desire to use the cache, as only through it, the cache pointer is updated. \par
\subsection{Policy Functions}
\label{subsec:design:policy-functions}
In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring function pointers performing dynamic memory management be passed. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par
In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring functions performing dynamic memory management. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par
The policy functions receive parameters deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par
The policy functions receive arguments deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par
For memory management, two functions are required, providing allocation (\texttt{malloc}) and deallocation (\texttt{free}) capabilities to the \texttt{Cache}. Following the naming scheme, these two functions must adhere to the thread safety guarantees and behaviour set forth by the C++ standard. Most notably, \texttt{malloc} must never return the same address for subsequent or concurrent calls. \par
\subsection{Cache Entry Reuse}
\label{subsec:design:cache-entry-reuse}
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must therefore also ensure that, on concurrent access to the same resource, only one thread creates \texttt{CacheData} while the others are provided with a copy. \par
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement entry reuse, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must then ensure that, on concurrent access to the same resource, only one thread creates \texttt{CacheData} while the others are provided with a copy. \par
\subsection{Cache Entry Lifetime}
\label{subsec:design:cache-entry-lifetime}
@ -64,22 +62,29 @@ Allowing multiple references to the same entry introduces concerns regarding mem
\subsection{Weak Behaviour and Page Fault Handling}
\label{subsec:design:weakop-and-pf}
During our testing phase, we discovered that \gls{intel:dml} does not support interrupt-based completion signaling, as discussed in Section \ref{subsubsec:state:completion-signal}. Instead, it resorts to busy-waiting, which consumes CPU cycles and is primarily beneficial for reducing power consumption during copies \cite{intel:analysis}. To mitigate this issue, we extended the functionality of both \texttt{Cache} and \texttt{CacheData}, providing weak versions of \texttt{Cache::Access} and \texttt{CacheData::WaitOnCompletion}. The weak wait function only checks for operation completion once and then returns, relaxing the guarantee that the cache location will be valid after the call. Hence, the user must verify validity even after the wait function if they choose to utilize this option. Similarly, weak access merely returns a pre-existing instance of \texttt{CacheData}, bypassing caching. This feature proves beneficial in latency-sensitive scenarios where the overhead from cache operations and waiting for operation completion is undesirable.\par
During our testing phase, we discovered that \gls{intel:dml} does not support interrupt-based completion signaling, as discussed in Section \ref{subsubsec:state:completion-signal}. Instead, it resorts to busy-waiting, which consumes CPU cycles, thereby impacting concurrent operations. To mitigate this issue, we extended the functionality of both \texttt{Cache} and \texttt{CacheData}, providing weak versions of \texttt{Cache::Access} and \texttt{CacheData::WaitOnCompletion}. The weak wait function only checks for operation completion once and then returns, irrespective of the completion state, thereby relaxing the guarantee that the cache location will be valid after the call. Hence, the user must verify validity after the wait if they choose to utilize this option. Similarly, weak access merely returns a pre-existing instance of \texttt{CacheData}, bypassing work submission when no entry is present for the requested memory block. These features prove beneficial in latency-sensitive scenarios where the overhead from cache operations and waiting on completion is undesirable. As the weak functions constitute optional behaviour, not integral to the \texttt{Caches'} operation, Figure \ref{fig:impl-design-interface} does not cover them. \par
Additionally, while optimizing for access latency, we encountered delays caused by page fault handling on the \gls{dsa}, as depicted in Figure \ref{fig:dml-memcpy}. These delays not only affect the current task but also impede the progress of other tasks on the \gls{dsa}. Consequently, the default behaviour of the cache is set to trigger an error on page faults, while still offering the option to let the \gls{dsa} handle the fault. \par
Additionally, while optimizing for access latency, we encountered delays caused by page fault handling on the \gls{dsa}. These not only affect the current task but also impede the progress of other tasks on the \gls{dsa}, by blocking. Consequently, the \texttt{Cache} defaults to not handle page faults. \par
To configure runtime behaviour, we introduced a flag system to both \texttt{Cache} and \texttt{CacheData}, with the latter inheriting any flags set in the former upon creation. This design allows for global settings, such as opting for weak waits or enabling page fault handling. Weak waits can also be selected in specific situations by setting the flag on the \texttt{CacheData} instance. For \texttt{Cache::Access}, the flag must be set for each function call, defaulting to strong access, as exclusively using weak access would result in no cache utilization. \par
To configure runtime behaviour for page fault handling and access type, we introduced a flag system to both \texttt{Cache} and \texttt{CacheData}, with the latter inheriting any flags set in the former upon creation. This design allows for global settings, such as opting for weak waits or enabling page fault handling. Weak waits can also be selected in specific situations by setting the flag on the \texttt{CacheData} instance. For \texttt{Cache::Access}, the flag must be set for each function call, defaulting to strong access, as exclusively using weak access would result in no cache utilization. \par
\section{Usage Restrictions}
\label{sec:design:restrictions}
In the context of this work, the cache primarily manages static data, leading to two restrictions placed on the invalidation operation, allowing significant reductions in design complexity. Firstly, due to the cache's design, overlapping areas in the cache will lead to undefined behaviour during the invalidation of any one of them. Only entries with equivalent source pointers will be invalidated, while other entries with differing source pointers, still covering the now-invalidated region due to their size, will remain unaffected. Consequently, the cache may or may not continue to contain invalid elements. Secondly, invalidation is a manual process, necessitating the programmer to recall which data points are currently cached and to invalidate them upon modification. In this scenario, no ordering guarantees are provided, potentially resulting in threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
\caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
\label{fig:overlapping-access}
\end{figure}
In the context of this work, the cache primarily manages static data, leading to two restrictions placed on the invalidation operation, allowing significant reductions in design complexity. Firstly, due to the cache's design, accessing overlapping areas will lead to undefined behaviour during the invalidation of any one of them. Only entries with equivalent source pointers will be invalidated, while other entries with differing source pointers, still covering the now-invalidated region due to their size, will remain unaffected. Consequently, the cache may or may not continue to contain invalid elements. This scenario is depicted in Figure \ref{fig:overlapping-access}, where \(A\) points to a memory region, which due to its size, overlaps with the region pointed to by \(B\). Secondly, invalidation is a manual process, necessitating the programmer to recall which data points are currently cached and to invalidate them upon modification. In this scenario, no ordering guarantees are provided, potentially resulting in threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
Additionally, the cache inherits some restrictions due to its utilization of the \gls{dsa}. As mentioned in Section \ref{subsubsec:state:ordering-guarantees}, only write-ordering is guaranteed under specific circumstances. Although we configured the necessary parameters for this guarantee in Section \ref{sec:state:setup-and-config}, load balancing over multiple \gls{dsa}s, as described in Section \ref{sec:impl:application}, can introduce scenarios where writes to the same address may be submitted on different accelerators. As the ordering guarantee is provided on only one \gls{dsa}, undefined behaviour can occur in \enquote{multiple-writers}, in addition to the \enquote{read-after-write} scenarios. However, due to the constraints outlined in Section \ref{sec:design:interface}, the \enquote{multiple-writers} scenario is prevented, by ensuring that only one thread can perform the caching task for a given datum. Moreover, the requirement for user-provided memory management functions to be thread-safe (Section \ref{subsec:design:policy-functions}) ensures that two concurrent cache accesses will never receive the same memory region for their task. These two guarantees in conjunction secure the caches' integrity. Hence, the only relevant scenario is \enquote{read-after-write}, which is also accounted for since the cache pointer is updated by \texttt{CacheData::WaitOnCompletion} only when all operations have concluded. Situations where a caching task (read) depending on the results of another task (write) are thereby prevented. \par
Additionally, the cache inherits some restrictions of the \gls{dsa}. Due to the asynchronous operation and internal workings, multiple operation ordering hazards (Section \ref{subsubsec:state:ordering-guarantees}) may arise, either internally within one \gls{dsa}, or externally when coming in conflict with other \gls{dsa}s. However, these hazards do not impact the user of \texttt{Cache}, as the design implicitly prevents hazardous situations. Specifically, \texttt{CacheData::WaitOnCompletion} ensures that memory regions written to by the \gls{dsa} are only accessible after operations have completed. This guards against both intra- and inter-accelerator ordering hazards. Memory regions that a \gls{dsa} may access are internally allocated in the \texttt{Cache} and become retrievable only after operation completion is signalled, thus preventing submissions for the same destination. Additionally, multiple threads submitting tasks for the same source location does not pose a problem, as the access is read-only, and is also prevented by \texttt{Cache::Access} only performing work submission on one thread. \par
\todo{consider listing possible hazards before dropping them out of thin air here}
Despite accounting for the hazards in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed, through a call to \texttt{CacheData::WaitOnCompletion}, will result in an undefined state for the cached region. Therefore, it is crucial to explicitly wait for caching operations to complete before modification of the source memory to avoid such scenarios. \par
Despite accounting for the complications in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed through a call to \texttt{CacheData::WaitOnCompletion} will result in an undefined state for the cached region. Therefore, it is imperative to explicitly wait for data present in the cache to avoid such scenarios. \par
Conclusively, the \texttt{Cache} should not be used with overlapping buffers. Caching mutable data presents design challenges which could be solved by implementing a specific templated data type, which we will call \texttt{CacheMutWrapper}. This type internally stores its data in a struct which can then be cached and is tagged with a timestamp. \texttt{CacheMutWrapper} then ensures that all \gls{mutmet} also invalidate the cache, perform waiting on operation completion to avoid modification while the \gls{dsa} accesses the cached region, and update the timestamp. This stamp is then used by further processing steps to verify that a value presented to them by a thread was calculated using the latest version of the data contained in \texttt{CacheMutWrapper}, guarding against the lack of ordering guarantees for invalidation, as mentioned by the first paragraph of this section. \par
@ -26,16 +26,16 @@ In this chapter, we concentrate on specific implementation details, offering an
The usage of locking and atomics to achieve safe concurrent access has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will focus on the synchronization techniques employed. \par
Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. Use of a handler is also displayed in the \texttt{memcpy}-function for the \gls{dsa} as shown in Figure \ref{fig:dml-memcpy}. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par
Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator, exposing the completion state of the asynchronously executed task. Use of a handler is also displayed in the \texttt{memcpy}-function for the \gls{dsa} as shown in Figure \ref{fig:dml-memcpy}. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par
\subsection{Cache: Locking for Access to State}
\label{subsec:impl:cache-state-lock}
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:interface} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
To keep track of the current cache state, the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:interface} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second design decision requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. Both flushing and clear operations necessitate unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required when adding the newly created entry to cache. \par
A map was chosen as the data structure to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} can not hinder progress of caching operations on other \gls{numa:node}s, while also reducing potential for lock contention by spreading the load over multiple locks. \par
A map was chosen as the data structure to represent the current cache state with the key being the memory address of the entry, holding a \texttt{CacheData} instance as value. Thereby we achieve the promise of keeping entries cached whenever possible. With the \texttt{Cache} keeping an internal reference, the last reference to \texttt{CacheData} will only be destroyed upon removal of the cache state. This removal can either be triggered implicitly by flushing when encountering \gls{mempress} during a cache access, or explicitly by calls to \texttt{Cache::Flush} or \texttt{Cache::Clear}. \par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
With the caching policy being controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This also reduces the potential for lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. Even with this optimization, scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, will lead to threads contending for unique access to the state. \par
\subsection{CacheData: Shared Reference}
\label{subsec:impl:cachedata-ref}
@ -48,27 +48,27 @@ As Section \ref{subsec:design:cache-entry-reuse} details, we intend to share a c
We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads, as no guarantees were found in the documentation. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in Section \ref{subsec:design:cache-entry-reuse}, threads need to coordinate on a master thread which performs the actual waiting, while the others wait on the master. \par
Upon call to \texttt{Cache::Access}, coordination must again take place to only add one instance of \texttt{CacheData} to the cache state and, most importantly, submit to the \gls{dsa} only once. This behaviour was elected in Section \ref{subsec:design:cache-entry-reuse} to reduce the load placed on the accelerator by preventing duplicate submission. To solve this, \texttt{Cache::Access} will add the instance to the cache state under unique lock (see Section \ref{subsec:impl:cache-state-lock} above) and only then, when the current thread is guaranteed to be the holder of the unique instance, submission will take place. Thereby, the cache state may contain\texttt{CacheData} without a valid cache pointer and no handlers available to wait on. As the following paragraph and sections demonstrate, this resulted in implementation challenges. \par
Upon call to \texttt{Cache::Access}, coordination must take place to only add one instance of \texttt{CacheData} to the cache state and, most importantly, submit to the \gls{dsa} only once. This behaviour was elected in Section \ref{subsec:design:cache-entry-reuse} to reduce the load placed on the accelerator by preventing duplicate submission. To solve this, \texttt{Cache::Access} will add the instance to the cache state under unique lock (see Section \ref{subsec:impl:cache-state-lock} above) and only then, when the current thread modified the cache state, submission will take place. This presents a data race, with one thread successfully adding a new entry to the cache. This thread is then chosen to perform the submission. Through this two-stepped process, the state may contain \texttt{CacheData} without a valid cache pointer and no handlers available to wait on. As the following paragraph and sections demonstrate, this resulted in implementation challenges. \par
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\) and \(T_3\). Then \(T_1\) holds the handlers exclusively, leading to the other threads having to wait for \(T_1\) to perform the work submission and waiting before they can access the datum through the cache.}
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\). Then \(T_1\) holds the handlers exclusively, leading to \(T_2\) having to wait for \(T_1\) to perform the work submission and waiting.}
In the first implementation, a thread would check if the handlers are available and atomically wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\)and \(T_3\) obtain access to the incomplete\texttt{CacheData} on which they wait, causing them to see \texttt{nullptr} for the cache pointer and handlers. Therefore, \(T_2\)and \(T_3\)wait on the cache pointer becoming valid (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Then \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\)and \(T_3\)continue to wait on the cache pointer. Therefore, only \(T_1\) can trigger the waiting on the handlers and is therefore capable of keeping \(T_2\)and \(T_3\)from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\). \par
In the first implementation, a thread would check if the handlers are available and wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that two threads \(T_1\) and \(T_2\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\)obtains access to the incomplete \texttt{CacheData} on which it calls\texttt{CacheData::WaitOnCompletion}, causing it to see \texttt{nullptr} for the cache pointer and handlers. Therefore, \(T_2\) waits on the cache pointer becoming valid (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Then \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) continues to wait on the cache pointer. Therefore, only \(T_1\) can trigger the waiting on the handlers and is therefore capable of keeping \(T_2\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\). \par
\caption{\texttt{CacheData::WaitOnCompletion} Pseudocode. Final rendition of the implementation for a fair wait function.}
\label{fig:impl-cachedata-waitoncompletion}
\end{figure}
To resolve this, a more intricate implementation is required, for which Figure \ref{fig:impl-cachedata-waitoncompletion} shows pseudocode. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now wait on the handlers-pointer to change. When \(T_1\) supplies the handlers after submitting work, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer. Through this, the exclusion that was observable in the first implementation is already avoided. The handlers will be atomically set to a valid pointer and a thread may pass the wait. Following this, the handlers-pointer is atomically exchanged, invalidating it and assigning the previous valid value to a local variable. Each thread checks whether it has received a valid pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer and the designated master-thread. The master is tasked with waiting, while other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again, leading to a wait on the master thread setting the cache to a valid value. \par
To resolve this, a more intricate implementation is required, for which Figure \ref{fig:impl-cachedata-waitoncompletion} shows pseudocode. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does. We will use the same example as before to illustrate the second part of the waiting procedure. Thread \(T_2\) arrives in this latter section as the cache was invalid at the point in time when waiting was called for. They now wait on the handlers-pointer to change. When \(T_1\) supplies the handlers after submitting work, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer. Through this, the exclusion that was observable in the first implementation is already avoided. The handlers will be atomically set to a valid pointer and a thread may pass the wait. Following this, the handlers-pointer is atomically exchanged, invalidating it and assigning the previous valid value to a local variable. Each thread checks whether it has received a valid pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer and the designated master-thread. The master is tasked with waiting, while other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again, leading to a wait on the master thread setting the cache to a valid value. \par
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
@ -46,7 +46,7 @@ The simple query presents a challenging scenario for the \texttt{Cache}. The exe
\label{fig:timing-comparison}
\end{figure}
\begin{table}[!t]
\begin{table}[t]
\centering
\input{tables/table-qdp-baseline.tex}
\caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline. \glsentryshort{hbm} presents the upper boundary achievable, as it simulates prefetching without processing overhead and delay.}
@ -63,31 +63,31 @@ We benchmarked two methods to establish a baseline and an upper limit as referen
From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt{b} through \gls{hbm} results in an increase in processing speed. To gain a better understanding of how the increased bandwidth of \gls{hbm} accelerates the query, we will delve deeper into the time spent in the different stages of the query execution plan. \par
Due to the higher bandwidth provided by \gls{hbm} for \(AGGREGATE\), the CPU waits less for data from main memory, thereby improving processing times. This is evident in the overall shorter time taken for\(AGGREGATE\) in Figure \ref{fig:timing-comparison:upplimit} compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with aggregate requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par
Due to the higher bandwidth provided by accessing column \texttt{b} through \gls{hbm} in\(AGGREGATE\),processing time is shortened, as demonstrated by comparing Figure \ref{fig:timing-comparison:upplimit} to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. Consequently, more threads can be assigned to \(SCAN_a\), with \(AGGREGATE\) requiring less resources. This explains why the \gls{hbm}-results not only show faster processing times than \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). \par
\subsection{Benchmarks using Prefetching}
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario, through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
\begin{table}[!t]
\begin{table}[t]
\centering
\input{tables/table-qdp-speedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes.}
\label{table:qdp-speedup}
\end{table}
We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the threads processing \(SCAN_a\), while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (called \enquote{HBM (Upper Limit)} in the same Table). This confirms our assumption that \(SCAN_a\) itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-task timings. \par
We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the threads processing \(SCAN_a\), while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (\enquote{HBM (Upper Limit)} in Table \ref{table:qdp-speedup}). This confirms our assumption that \(SCAN_a\) itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-task timings. \par
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram}\glsentryshort{numa:node}s.}
@ -97,7 +97,7 @@ We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for pref
\label{fig:timing-results}
\end{figure}
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it only performs work submission through the \texttt{Cache} and therefore does not access system memory directly. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the scan and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it offloads memory access to the \gls{dsa} through the \texttt{Cache}, thereby not showing extended runtime from the throughput bottleneck. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the scan and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par
@ -107,7 +107,7 @@ In Section \ref{sec:eval:expectations}, we anticipated that the simple query wou
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par
As stated in Chapter \ref{chap:design}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram}, were not considered. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{subsec:design:policy-functions}. Potential applications include replication to \gls{nvram} for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
As stated in Chapter \ref{chap:design}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram}, were not considered. Despite the methods of the \texttt{Cache}-class being named with usage as a cache in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{subsec:design:policy-functions}. Potential applications include replication to \gls{nvram} for data loss prevention, or restoring from \gls{nvram} for faster processing. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
<mxfilehost="app.diagrams.net"modified="2024-01-22T13:08:47.326Z"agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"etag="EecyPuI9w8mPnqGy16gK"version="22.1.21"type="device">
<mxfilehost="app.diagrams.net"modified="2024-02-19T15:56:33.643Z"agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"etag="rQ8FcWYpjkzjEaIhE64A"version="23.1.5"type="device">
description={\textsc{\glsentrylong{mempress}:} Situation where high memory utilization is encountered.}
description={\textsc{Memory Pressure:} Situation where high memory utilization is encountered.}
}
\newglossaryentry{api}{
@ -116,3 +115,9 @@
first={Non-Volatile RAM (NVRAM)},
description={\textsc{\glsentrylong{nvram}:} Main memory technology which, unlike \glsentryshort{dram}, retains data when without power.}
}
\newglossaryentry{mutmet}{
short={mutating method},
name={Mutating Method},
description={\textsc{Mutating Method:} Member function (method) capable of mutating data of a classes instances. This is the default in C++, unless explicitly disabled by marking a method const or static.}