% geben (für irgendetwas müssen die Betreuer ja auch noch da
% sein).
This bachelor's thesis explores the dynamic landscape of heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. These systems necessitate strategic decisions regarding data placement to optimize performance, requiring the movement of data across different storage tiers. Consequently, the responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems.
This bachelor's thesis explores the dynamic landscape of heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems.
% den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
% als 10 Seiten liest keiner.
The proliferation of various technologies, such as..., has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. Traditionally tasked with managing data locality, the CPU faces an added burden in heterogeneous memory environments, thereby diminishing available processing cycles. To mitigate this strain on the CPU, certain Intel Server CPUs now feature the \glsentryfirst{dsa}\cite{intel:xeonbrief}. \par
The proliferation of various technologies, such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. Traditionally tasked with managing data locality, the CPU faces an added burden in heterogeneous memory environments, thereby diminishing available processing cycles. To mitigate this strain on the CPU, certain Intel Server Processors now feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded\cite{intel:xeonbrief}. With it, this thesis undertakes the challenge of optimizing data locality on \glsentrylong{numa}s. \par
In response to these challenges, this thesis undertakes the intricate task of optimizing data movement operations. At the core of this endeavor lies the introduction of the \gls{dsa}, which plays a pivotal role in enhancing streaming data movement operations across diverse applications. A thorough understanding of the architecture and functionality of \gls{dsa} is essential in addressing the challenges posed by this new form of \gls{numa}. \par
The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the architecture of the Intel \gls{dsa}. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. \par
The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the Intel \gls{dsa} architecture. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. This thesis seeks to explore how \gls{dsa} can be strategically utilized to tackle the challenges posed by heterogeneous memory systems, offering insights into the integration of data streaming acceleration with intelligent prefetching. \par
This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. The code for this is made available in the accompanying repository \cite{thesis-repo} under 'offloading-cacher'. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. As of the time of writing, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system and provide benchmarks for programming through the \glsentryfirst{dml}. Furthermore, performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm} using \gls{dsa} has not yet been evaluated by the scientific community. \par
This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. As of the time of writing, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system and provide benchmarks for programming through the \glsentryfirst{intel:dml}. Furthermore, performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm} using \gls{dsa} has not yet been evaluated by the scientific community. \par
The Technical Background chapter furnishes the reader with pertinent background information necessary for understanding the subsequent sections of this work, encompassing \gls{hbm}, \gls{qdp}, and \gls{dsa} along with its programming interface \cite{intel:dmldoc}. Additionally, guidance on system setup and configuration is provided. Subsequently, the Performance Microbenchmarks section analyzes the strengths and weaknesses of the \gls{dsa}. Methodologies are presented, each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. The following sections, Design and Implementation, elucidate the practical aspects of the work, including the development of the interface and implementation for an offloading cache, shedding light on specific design considerations and implementation challenges. The Evaluation section offers a comprehensive assessment of the implemented solution. In Conclusion, insights gained are reflected upon, and the contributions and results of the preceding chapters are reviewed. \par
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \\\(\frac{Throughput}{Basline\ Throughput}*\frac{1}{Utilization\ Factor}\)\\ with the baseline being Throughput for 1 \gls{dsa} and the utilization factor representing the factor of the amount of \gls{dsa}s being used over the baseline.}
% nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
% auf denen man sie sich ansehen kann.
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Subsequently, we provide an example of the policy functions alluded to in Section \ref{sec:design:accel-usage}. Finally, we apply the cache to \glsentrylong{qdp}. \par
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Finally, we apply the cache to \glsentrylong{qdp}, detailing the policies mentioned in Section \ref{sec:design:accel-usage} and presenting solutions for the challenges encountered. \par
\section{Locking and Usage of Atomics}
@ -28,7 +28,7 @@ The usage of locking and atomics has proven to be challenging. Their use is perf
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:cache} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
A map-datastructure was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other \gls{numa:node}s. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per \gls{numa:node} state, also allowing other \gls{numa:node} to progress.\par
@ -53,9 +53,9 @@ Therefore, the decision was made to implement atomic reference counting for \tex
Due to the possibility of access by multiple threads, the implementation of \texttt{CacheData::WaitOnCompletion} proved to be challenging. In the first implementation, a thread would check if the handlers are available and atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. \par
Due to the possibility of access by multiple threads, the implementation of \texttt{CacheData::WaitOnCompletion} proved to be challenging. In the first implementation, a thread would check if the handlers are available and atomically wait \cite{cppreference:atomic-wait} on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. \par
To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
\begin{figure}[h!tb]
\centering
@ -64,14 +64,14 @@ To illustrate this, an exemplary scenario is used, as seen in the sequence diagr
\label{fig:impl-cachedata-waitoncompletion}
\end{figure}
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with \texttt{nullptr}, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution. \par
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from nullptr to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from \texttt{nullptr} to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
\subsubsection{Invalid State on Immediate Destruction}
@ -83,9 +83,9 @@ To circumvent this deadlock, the initial state of \texttt{CacheData} was modifie
\subsubsection{Invalid State on Operation Failure}
\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to nullptr using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be nullptr. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par
\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to \texttt{nullptr} using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be \texttt{nullptr}. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par
In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still nullptr and as the handlers have already been set and processed, they will also be nullptr without the chance of them ever becoming valid. \par
In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still \texttt{nullptr} and as the handlers have already been set and processed, they will also be \texttt{nullptr} without the chance of them ever becoming valid. \par
Edge case handling is introduced and the cache pointer is set to the source address, providing validity. \par
@ -93,27 +93,24 @@ Edge case handling is introduced and the cache pointer is set to the source addr
The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all}\cite{cppreference:atomic-notify-all}. \par
As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par
One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of nullptr and flag-set. Other threads then would then wait on the struct changing from nullptr and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp}\par
One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of \texttt{nullptr} and flag-set. Other threads then would then wait on the struct changing from \texttt{nullptr} and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp}\par
The chosen solution for this is to not exchange the handlers-pointer with nullptr but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par
The chosen solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
\section{Accelerator Usage}
After \ref{sec:design:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8=11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the Push-Pull method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. We introduce this distinction by datum size due to the observations in Section \ref{subsec:perf:submitmethod} that with lower transfer sizes, the effect of submission cost is amplified. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
\section{Application to \glsentrylong{qdp}}
\label{sec:impl:application}
Applying the \texttt{Cache} to \gls{qdp} is straightforward. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, calling \texttt{Cache::Access} for both prefetching and cache access. \par
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
During performance analysis of the developed \texttt{Cache}, we found that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead busy-waits on the completion descriptor being updated. As this busy waiting costs CPU cycles, waiting on task completion is out of question, requiring code modifications which we now detail. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
We extended \texttt{CacheData} and \texttt{Cache} to provide support for weak waiting. Through a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created through \texttt{Cache::Access} will only check once whether the \gls{dsa} has completed processing the operation, and otherwise return without updating the cache-pointer. Calls to \texttt{CacheData::GetDataLocation} can therefore return nullptr, even after waiting. The user is then responsible to access the data through its source location. For latency-critical applications, \texttt{Cache::Access} provides the option for weak access. When set, the function will only return an existing instance of \texttt{CacheData} and therefore does not cause work submission to the \gls{dsa}. \par
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. \par
% bezüglich der Ergebnisse zu erläutern und anschließend eventuell
% festgestellte Abweichungen zu erklären.
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}. To measure the performance, we adapted code developed by colleagues André Berthold and Anna Bartuschka for evaluating \gls{qdp} in \cite{dimes-prefetching}. \par
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
\section{Benchmarked Task}
\label{sec:eval:bench}
The benchmark executes a simple query as illustrated in Figure \ref{fig:qdp-simple-query}. We will from hereinafter use notations \(SCAN_a\) for the pipeline that performs scan and subsequently filter on column \texttt{a}, \(SCAN_b\) for the pipeline that prefetches column \texttt{b} and \(AGGREGATE\) for the projection and final summation step. We use a column size of 4 GiB and divide this work over 32 groups, assigning one thread to each pipeline per group \todo{maybe describe that the high group count was chosen to allow the implementation without overlapping pipeline execution to still perform well}\todo{consider moving to overlapping pipelines if enough time is found}. For configurations not performing prefetching, \(SCAN_b\) is not executed. We measure the times spent in each pipeline, cache hit percentage and total processing time. \par
Pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their workload and then signalling \(AGGREGATE\), which then finalizes the operation. With the goal of improving cache hit rate, we decided to loosen this restriction and let \(SCAN_b\) work freely, only synchronizing \(SCAN_a\) with \(AGGREGATE\). Work is therefore submitted to the \gls{dsa} as frequently as possible, thereby hopefully completing each caching operation for a chunk of \texttt{b} before \(SCAN_a\) finishes processing the associated chunk of \texttt{a}. \par
To ensure
\section{Expectations}
\label{sec:eval:expectations}
The simple query presents a challenging scenario to the \texttt{Cache}. As the filter operation applied to column \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the \texttt{Cache} has little time during which it must prefetch, which will amplify delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Additionally, it can be assumed that the task is memory bound. As the prefetching of \texttt{b} in \(SCAN_b\) and the load and subsequent filter of \texttt{a} in \(SCAN_a\) will execute in parallel, caching therefore directly reduces the memory bandwidth available to \(SCAN_a\), when both columns are located on the same \gls{numa:node}. \par
Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
\section{Observations}
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, representing the upper and lower boundaries respectively. \par
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
\label{fig:timing-comparison:upplimit}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\) and \(AGGREGATE\) without prefetching for different locations of column \texttt{b}. Figure (a) represents the lower boundary by using only \glsentryshort{dram}, while Figure (b) simulates perfect caching by storing column \texttt{b} in \glsentryshort{hbm} during benchmark setup.}
\label{fig:timing-comparison}
\end{figure}
The benchmark executes a simple query as illustrated in Figure \ref{fig:eval-simple-query} which presents a challenging scenario to the cache. As the filter operation applied to \texttt{a} is not particularly complex, its execution time can be assumed to be short. Therefore, the Cache has little time during which it must prefetch, which will amplify delays caused by processing overhead in the Cache itself or from submission to the Work Queue. This makes the chosen query suited to stress test the developed solution. \par
Our baseline will be the performance achieved with both columns \texttt{a} and \texttt{b} located in \glsentryshort{dram} and no prefetching. The upper limit will be represented by measuring the scenario where \texttt{b} is already located in \gls{hbm} at the start of the benchmark, simulating prefetching with no overhead or delay. \par
\caption{Prefetching with columns \texttt{a} and \texttt{b} located on different \glsentryshort{dram}\glsentryshort{numa:node}s.}
\label{fig:timing-results:distprefetch}
\end{subfigure}
\caption{Time spent on functions \(SCAN_a\), \(SCAN_b\) and \(AGGREGATE\) with prefetching. Operations \(SCAN_a\) and \(SCAN_b\) execute concurrently. Figure (a) shows bandwidth limitation as time for \(SCAN_a\) increases drastically due to the copying of column \texttt{b} to \glsentryshort{hbm} taking place in parallel. For Figure (b), the columns are located on different \glsentryshort{numa:node}s, thereby the \(SCAN\)-operations do not compete for bandwidth.}
\label{fig:timing-results}
\end{figure}
With this difficult scenario, we expect to spend time analysing runtime behaviour of our benchmark in order to optimize the Cache and the way it is applied to the query. Optimizations should yield slight performance improvement over the baseline, using DRAM, and will not reach the theoretical peak, where the data for \texttt{b} resides in HBM. \par
\begin{itemize}
\item Fig \ref{fig:timing-results:distprefetch} aggr for prefetch theoretically at level of hbm but due to more concurrent workload aggr gets slowed down too
\item Fig \ref{fig:timing-results:prefetch} scana increase due to sharing bandwidth reasonable, aggr seems a bit unreasonable
\end{itemize}
Consider using parts of flamegraph. Same speed as dram, even though allocation is performed in the timed region and not before. Mention dml performs busy waiting (cite dsa-paper 4.4 for use of interrupts mentioned in arch), optimization with weak wait. Mention optimization weak access for prefetching scenario.
\todo{consider benchmarking only with one group and lowering wl size accordingly to reduce effects of overlapping on the graphic}