The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond Query Driven Prefetching, the choice was made to solve the prefetching offload by implementing an offloading \texttt{Cache}. When refering to the provided implementation, \texttt{Cache} will be used from now on. The interface with \texttt{Cache} must provide three basic functions: requesting a memory block to be cached, accessing a cached memory block and synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, requiring the entry to be updated with the source or the other way around. Due to the many possible setups and use cases, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible while being removed when system memory pressure due to restrictive memory size drives the \texttt{Cache} to flush unused entries. \par
As the usage of locking and atomics may have a significant impact on performance, their application will be discussed in detail within this section. \todo{extend introductory paragraph}\par
\subsection{Cache State Lock}
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
To keep track of the current cache state,
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
To keep track of the current cache state, a map is used internally which associates a memory address to a\texttt{CacheData} instance. In \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers, requiring thread safety when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache, which however is a rather short operation. \par
A map was chosen to represent the current cache state with the key being the memory address of the entry and as value the\texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accomodate this, one map is allocated for each available node of the system. This can be exploited to reduce lock contention by separately locking each nodes state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other nodes. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per nodes state, also allowing other nodes to progress.\par
In scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads accessing large amounts of data, leading to high memory pressure, lock contention around this lock will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
\subsection{CacheData Reference Counting}
\todo{write this section}
\subsection{CacheData WaitOnCompletion}
\todo{write this section}
\subsection{Performance Guideline}
Atomic operations come with an added performance penalty. No recent studies were found on this, although we assume that the findings of Hermann Schweizer et al. in "Evaluating the Cost of Atomic Operations on Modern Architectures" \cite{atomics-cost-analysis} still hold true for todays processor architectures. Due to the inherent cache synchronization mechanisms in place \cite[Subsection IV.A.3 Off-Die Access]{atomics-cost-analysis}, they observed significant access latency increase depending on whether the atomic variable was located on the local core, on a different core on the same chip or on another socket \cite[Fig. 4]{atomics-cost-analysis}.
Atomic operations come with an added performance penalty. No recent studies were found on this, although we assume that the findings of Hermann Schweizer et al. in "Evaluating the Cost of Atomic Operations on Modern Architectures" \cite{atomics-cost-analysis} still hold true for todays processor architectures. Due to the inherent cache synchronization mechanisms in place \cite[Subsection IV.A.3 Off-Die Access]{atomics-cost-analysis}, they observed significant access latency increase depending on whether the atomic variable was located on the local core, on a different core on the same chip or on another socket \cite[Fig. 4]{atomics-cost-analysis}. Reducing the cost of atomic accesses would require a less generic implementation, reducing some of the guarantees we give in \ref{sec:design:cache}. This would allow reducing the amount of atomics required but is outside of the scope of this work. \par
The performance impact of lock contention \todo{find a reference for this and use above too} and atomic synchronization \todo{find a reference for this and use above too} is not to be taken lightly, as \texttt{Cache} may be used in performance critical systems. Reducing their impact is therefore desireable which can be achieved in multiple ways. The easiest is to have one instance of \texttt{Cache} per \gls{numa:node} which reduces both lock contention by just serving less threads and atomic synchronization as the atomics are shared between physically close cpu cores \todo{find a reference that shows that physical distance affects sync cost}. This requires no code modification but does not inherently reduce the amount of synchronization taking place. To achieve this reduction, restrictions must be placed upon the thread safety or access guarantees, which is not sensible for this generic implementation.
With the distributed locking described in \ref{subsec:implementation:cache-state-lock}, lock contention should not have a significant impact, although this remains to be tested. In addition to that, passive waiting at the contended section will benefit other threads and might allow overall progress to continue, if the application utilizing the cache has a threading model supporting this. These two factors lead us to classify lock contention as only a minor performance problem. \par
\section{Accelerator Usage}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies under 64 MiB will be handled exclusively by the current node. This size is quite high but due to the overhead of assigning the current thread to the selected nodes, using only the current one is more efficient. This assignment is required due to \gls{intel:dml} not being \gls{numa} aware and therefore assigning submissions only to the \gls{dsa} engine present on the node that the calling thread is assigned to\cite[Section "NUMA support"]{intel:dmldoc}. \todo{Actually test which size makes sense, due to numa-setaffinity and scheduling overhead this will probably be much higher}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, whicht is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread\cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.