commit first draft of chapter 7, modifications to other chapters include adding labels for back-references and slight rephrasing during skimming of paragraphs
@ -39,6 +39,7 @@ This chapter introduces the relevant technologies and concepts for this thesis.
\todo{find a nice graphic for hbm}
\todo{find a nice graphic for hbm}
\section{\glsentrylong{qdp}}
\section{\glsentrylong{qdp}}
\label{sec:state:qdp}
\begin{figure}[h!tb]
\begin{figure}[h!tb]
\centering
\centering
@ -50,6 +51,7 @@ This chapter introduces the relevant technologies and concepts for this thesis.
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory during the execution of the pipeline. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching}\par
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory during the execution of the pipeline. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching}\par
\section{\glsentrylong{dsa}}
\section{\glsentrylong{dsa}}
\label{sec:state:dsa}
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}
@ -72,7 +74,7 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
\subsubsection{Architectural Components}
\subsubsection{Architectural Components}
\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:}\gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:}\glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}\cite[Sec. 3.3.2]{intel:dsaspec}. \par
\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
\textsc{Component \rom{2}, Engine:} An Engine is the processing-block that connects to memory and performs the described task. To handle the different descriptors, each Engine has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within \cite[Sec. 3.2]{intel:dsaspec}. For a batch, the \gls{dsa} reads the batch descriptor, then fetches all task descriptors from memory and processes them \cite[Sec. 3.8]{intel:dsaspec}. An Engine can coordinate with the operating system in case it encounters a page fault, waiting on its resolution, if configured to do so, while otherwise, an error will be generated in this scenario \cite[Sec. 2.2, Block on Fault]{intel:dsaspec}. \par
@ -49,7 +49,7 @@ When multiple consumers wish to access the same memory block through the \texttt
Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
The cache, in the context of this work, primarily handles static data. Therefore, two restrictions are placed on the invalidation operation. This decision results in a drastically simpler cache design, as implementing a fully coherent cache would require developing a thread-safe coherence scheme, which is beyond the scope of our work. \par
The cache, in the context of this work, primarily handles static data. Therefore, two restrictions are placed on the invalidation operation. This decision results in a drastically simpler cache design, as implementing a fully coherent cache would require developing a thread-safe coherence scheme, which is beyond the scope of our work. \par
@ -36,7 +36,7 @@ A map-datastructure was chosen to represent the current cache state with the key
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
\subsection{CacheData: Safe Implementation for WaitOnCompletion}
\subsection{CacheData: Fair Threadsafe Implementation for WaitOnCompletion}
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
@ -110,7 +110,7 @@ Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapte
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. \par
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. With this we also remove the dependency on libnuma, mentioned as a restriction in Section \ref{subsec:design:restrictions}\par
@ -30,6 +30,7 @@ The simple query presents a challenging scenario to the \texttt{Cache}. As the f
Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
Due to the challenges posed by sharing memory bandwidth we will benchmark prefetching in two configurations. The first will find both columns \texttt{a} and \texttt{b} located on the same \gls{numa:node}. We expect to demonstrate the memory bottleneck in this situation by execution time of \(SCAN_a\) rising by the amount of time spent prefetching in \(SCAN_b\). The second setup will see the columns distributed over two \gls{numa:node}s, still \glsentryshort{dram}, however. In this configuration \(SCAN_a\) should only suffer from the additional threads performing the prefetching. \par
\section{Observations}
\section{Observations}
\label{sec:eval:observations}
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, used as reference in evaluation of the results using our \texttt{Cache}. Two methods were benchmarked, representing a baseline and upper limit for what we can achieve with prefetching. For the former, all columns were located in \glsentryshort{dram}. The latter is achieved by placing column \texttt{b} in \gls{hbm} during benchmark initialization, thereby simulating perfect prefetching without delay and overhead. \par
In this section we will present our findings from applying the \texttt{Cache} developed in Chatpers \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}. We begin by presenting the results without prefetching, used as reference in evaluation of the results using our \texttt{Cache}. Two methods were benchmarked, representing a baseline and upper limit for what we can achieve with prefetching. For the former, all columns were located in \glsentryshort{dram}. The latter is achieved by placing column \texttt{b} in \gls{hbm} during benchmark initialization, thereby simulating perfect prefetching without delay and overhead. \par
% weitsichtig man doch ist. Dieses Kapitel muß kurz sein, damit es
% weitsichtig man doch ist. Dieses Kapitel muß kurz sein, damit es
% gelesen wird. Sollte auch "Future Work" beinhalten.
% gelesen wird. Sollte auch "Future Work" beinhalten.
\todo{write introductory paragraph}
With this work, we set out to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into\glsentrylong{qdp}. Within Section \ref{sec:state:dsa}we characterized the hardware and software architecture, following with the overview of an available programming interface, \glsentrylong{intel:dml}, in \ref{sec:state:dml}. Our benchmarks were directed at the planned application and therefore included copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}) and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). We discovered an anomaly for inter-socket copy speeds and found scaling of throughput to be distinctly below linear (Figure \ref{fig:perf-dsa-analysis:scaling}). Even though not all observations were explainable, the results give important insights into the behaviour of the \gls{dsa} and how it can be used in applications, complimenting existing analysis \cite{intel:analysis} with detail for multisocket systems and \gls{hbm}. \par
\section{Conclusions}
Applying the cache developed over Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp} we discovered challenges concerning the available memory bandwidth and lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. With the \texttt{Cache} being a substantial contribution to the field, we will review its shortcomings and effectiveness. \par
\todo{write this section}
At the current state, the applicability is limited to data that is mutated infrequently. Even though support exists for entry invalidation, no ordering guarantees are given and invalidation is performed strictly manually. This requires effort by the programmer in ensuring validity of results after invalidation, as threads may still hold pointers to now-invalid data. As a remedy for this situation, a custom type could be developed, automatically calling invalidation for itself through the cache upon modification. A timestamp or other form of unique identifier in this data type may be used to tag results, enabling checks for source data validity. \par
\begin{itemize}
\item lessons learned
\item where to apply the lessons and contributions
\item is it gainful like it is right now -> flow futurework
\end{itemize}
In Section \ref{sec:eval:observations} we observed negative effects when prefetching with the cache during parallel execution of memory-bound operations. With the simple query (see Section \ref{sec:state:qdp}) benchmarked, these effects were pronounced, as copy operations to cache directly cut into the bandwidth required by the pipeline, slowing overall progress more than the gains of accessing data through \gls{hbm} made up. This required distributing data over multiple \glsentrylong{numa:node}s to avoid affecting the bandwidth available to the pipelines by caching. Existing applications developed to run on \gls{numa} can be assumed to already perform this distribution to utilize the systems available memory bandwidth, therefore we do not consider this a major fault of the \texttt{Cache}. \par
\section{Future Work}
As noted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} used to interact with \gls{dsa} does not currently support interrupt-based completion waiting and use of \glsentrylong{dsa:dwq}. Further development may focus on directly accessing the \gls{dsa}, foregoing the \glsentrylong{intel:dml}, take advantage of the complete feature set. Especially interrupt-based waiting would greatly enhance the usability of the \texttt{Cache} which in the current state only allows for busy-waiting, which we judge to be unacceptable and therefore opted to implement weak access and waiting in Section \ref{sec:impl:application}. \par
\todo{write this section}
The previous paragraphs and the results in Chapter \ref{chap:evaluation} could lead the reader to conclude that the \texttt{Cache} requires extensive work to be usable in production applications. We however posit the opposite. In favourable conditions, which, as previously mentioned, may be assumed for \glsentryshort{numa}-aware applications, we observed noticeable speed-up by using the \texttt{Cache} to perform prefetching to \glsentrylong{hbm} to accelerate database queries. Its use is not limited to prefetching however, presenting a solution to handle transfers from remote memory to faster local storage in the background or replicating data to Non-Volatile RAM. Evaluating its effectiveness in these scenarios and performing further benchmarks on more complex scenarios for \gls{qdp} might yield more concrete insights into these diverse fields of application. \par
\begin{itemize}
\item evaluate performance with more complex query
\item evaluate impact of lock contention and atomics on performance
\item implement direct dsa access to assess gains from using shared work queue
\item improve the cache implementation for use cases where data is not static
\end{itemize}
Finally, the \texttt{Cache}, together with the Sections on the \gls{dsa}s architecture (\ref{sec:state:dsa}) and its performance characteristics (\ref{sec:perf:bench}) complete the stated goal of this work. We achieved performance gains through the \gls{dsa} in \gls{qdp}, and thereby demonstrated its potential to aid in exploiting the properties offered by the multitude of storage tiers in heterogeneous memory systems. \par