% nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
% nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
% auf denen man sie sich ansehen kann.
% auf denen man sie sich ansehen kann.
\todo{write introductory paragraph}
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Subsequently, we provide an example of the policy functions alluded to in Section \ref{sec:design:accel-usage}. Finally, we apply the cache to \glsentrylong{qdp}. \par
\section{Locking and Usage of Atomics}
\section{Locking and Usage of Atomics}
The usage of locking and atomics proved to be the challenging. Their use is performance critical and mistakes may lead to deadlock. Therefore, they also constitute the most interesting part of the implementation which is why this chapter will focus extensively on the details of the implementation in regard to these. \par
The usage of locking and atomics has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will extensively focus on the details of their implementation. \par
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in Subsection \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
A map was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available node of the system. This can be exploited to reduce lock contention by separately locking each nodes state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other nodes. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per nodes state, also allowing other nodes to progress.\par
A map-datastructure was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other \gls{numa:node}s. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per \gls{numa:node} state, also allowing other \gls{numa:node} to progress.\par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
\subsection{CacheData Atomicity}
\subsection{CacheData Atomicity}
The choice made in \ref{subsec:design:cache-entry-reuse}requires thread safe shared access to the same resource. \texttt{std::shared\_ptr<T>} provides a reference counted pointer, which is thread safe for the required operations, making it a prime candidate for this task. An implementation using it was explored but proved to offer its own set of challenges. As we wish to reduce time spent in a locked region, the task is only added to the nodes cache state when locked. Submission takes place outside, which is sensible, as this submitting one task should not hinder accessing another. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse} this would require the threads to coordinate which thread performs the actual waiting, as we assume the handlers of \gls{intel:dml} to be non-threadsafe. In order to avoid queuing multiple of the same copies, the task must be added before submission. This results in a \texttt{CacheData} instance with invalid cache pointer and no handlers to wait for being available, requiring additional usage of synchronization primitives. With using \texttt{std::shared\_ptr<T>} also comes the uncertainty of relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
The choice made in \ref{subsec:design:cache-entry-reuse}necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
It was therefore decided to implement atomic reference counting for \texttt{CacheData} which means providing a custom constructor and destructor wherein a shared (through a standard pointer however) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to increase or deacrease the reference counter and, in case of decrease in the destructor signals that the destructor is called for the last reference, perform actual destruction. The invalid state of \texttt{CacheData} achievable is also avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer and the pointer to the cache memory be atomic too. Through this we may use the atomic wait operation which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
Using \texttt{std::shared_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared (though a standard pointer) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
Additionally, the invalid state of \texttt{CacheData} is avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer, and the pointer to the cache memory must be atomic too. This enables the use of the atomic wait operation, which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
\begin{figure}[h]
\begin{figure}[h]
\centering
\centering
@ -47,7 +53,7 @@ It was therefore decided to implement atomic reference counting for \texttt{Cach
Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Now \(T_1\) submits the work and sets the handlers (marked red lines in \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Now only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Now \(T_1\) submits the work and sets the handlers (marked red lines in Figure\ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Now only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
\begin{figure}[h]
\begin{figure}[h]
\centering
\centering
@ -70,7 +76,11 @@ With the distributed locking described in \ref{subsec:implementation:cache-state
\section{Accelerator Usage}
\section{Accelerator Usage}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, which is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8=11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, which is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.
\section{Application to \glsentrylong{qdp}}
\todo{write this section or consider putting it in evaluation}