The usage of locking and atomics proved to be the challenging. Their use is performance critical and mistakes may lead to deadlock. Therefore they also constitute the most interesting part of the implementation which is why this chapter will focus extensively on the details of the implementation in regard to these. \par
The usage of locking and atomics proved to be the challenging. Their use is performance critical and mistakes may lead to deadlock. Therefore, they also constitute the most interesting part of the implementation which is why this chapter will focus extensively on the details of the implementation in regard to these. \par
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In \ref{sec:design:cache} we decided to keep elements in the cache until forced by memory pressure to remove them. Secondly in \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
A map was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accomodate this, one map is allocated for each available node of the system. This can be exploited to reduce lock contention by separately locking each nodes state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other nodes. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per nodes state, also allowing other nodes to progress.\par
A map was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available node of the system. This can be exploited to reduce lock contention by separately locking each nodes state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other nodes. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per nodes state, also allowing other nodes to progress.\par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
\subsection{CacheData Atomicity}
\subsection{CacheData Atomicity}
The choice made in \ref{subsec:design:cache-entry-reuse} requires thread safe shared access to the same resource. \texttt{std::shared_ptr,T>} provides a reference counted pointer, which is thread safe for the required operations, making it a prime candidate for this task. An implementation using it was explored but proved to offer its own set of challenges. As we wish to reduce time spent in a locked region, the task is only added to the nodes cache state when locked. Submission takes place outside, which is sensible, as this submitting one task should not hinder accessing another. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse} this would require the threads to coordinate which thread performs the actual waiting, as we assume the handlers of \gls{intel:dml} to be non-threadsafe. In order to avoid queuing multiple of the same copies, the task must be added before submission. This results in a \texttt{CacheData} instance with invalid cache pointer and no handlers to wait for being available, requiring additional usage of synchronization primitives. With using \texttt{std::shared_ptr<T>} also comes the uncertainty of relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in korean. No further research was found on this topic. \par
The choice made in \ref{subsec:design:cache-entry-reuse} requires thread safe shared access to the same resource. \texttt{std::shared\_ptr<T>} provides a reference counted pointer, which is thread safe for the required operations, making it a prime candidate for this task. An implementation using it was explored but proved to offer its own set of challenges. As we wish to reduce time spent in a locked region, the task is only added to the nodes cache state when locked. Submission takes place outside, which is sensible, as this submitting one task should not hinder accessing another. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse} this would require the threads to coordinate which thread performs the actual waiting, as we assume the handlers of \gls{intel:dml} to be non-threadsafe. In order to avoid queuing multiple of the same copies, the task must be added before submission. This results in a \texttt{CacheData} instance with invalid cache pointer and no handlers to wait for being available, requiring additional usage of synchronization primitives. With using \texttt{std::shared\_ptr<T>} also comes the uncertainty of relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
It was therefore decided to implement atomic reference counting for \texttt{CacheData} which means providing a custom constructor and destructor wherein a shared (through a standard pointer however) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to increase or deacrease the reference counter and, in case of deacrese in the destructor signals that the destructor is called for the last reference, perform actual destruction. The invalid state of \texttt{CacheData} achievable is also avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer and the pointer to the cache memory be atomic too. Through this we may use the atomic wait operation which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
It was therefore decided to implement atomic reference counting for \texttt{CacheData} which means providing a custom constructor and destructor wherein a shared (through a standard pointer however) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to increase or deacrease the reference counter and, in case of decrease in the destructor signals that the destructor is called for the last reference, perform actual destruction. The invalid state of \texttt{CacheData} achievable is also avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer and the pointer to the cache memory be atomic too. Through this we may use the atomic wait operation which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads T_1, T_2 and T_3 wish to access the same resource. T_1 now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before T_1 may submit the work, it is interrupted and T_2 and T_3 obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer. Now T_1 submits the work and sets the handlers, while T_2 and T_3 continue to wait. Now only T_1 can trigger the waiting and is therefore capable of keeping T_2 and T_3 from progressing. This is undesirable as it can lead to deadlocking if by some reason T_1 does not wait and at least may lead to unneccessary delay for T_2 and T_3 if T_1 does not wait immediately. \par
Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer. Now \(T_1\) submits the work and sets the handlers, while \(T_2\) and \(T_3\) continue to wait. Now only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
To solve this, a different and more complicated order of waiting operations is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. Lets take the same example as before to illustrate the second part of the waiting procedure. T_2 and T_3 now both arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers pointer to change, instead of doing it the other way around as before. Now when T_1 supplies the handlers, it also uses \texttt{std::atomic<T>::notify_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Now each thread again checks whether it has received a valid local pointer to the handlers from the exchange, if it has then the atomic operation guarantees that is is now in sole posession of the pointer and therefore tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
To solve this, a different and more complicated order of waiting operations is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. Let's take the same example as before to illustrate the second part of the waiting procedure. \(T_2\) and \(T_3\) now both arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Now each thread again checks whether it has received a valid local pointer to the handlers from the exchange, if it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
Some two additional cases must be considered for the latter implementation to be safe. The wait operation first checks for a valid cache pointer and then waits on the handlers becoming valid. After processing the handlers, they are deleted and the pointer therefore invalidated. Should the cache pointer now be invalid as well, deadlocks would ensue. Therefore the thread which exchanged the handlers pointer for a valid local copy must set the cache pointer to a valid value. Should one of the offloaded operations have failed, using the cache pointer is out of question as the datum it references might be invalid. Therefore the cache is set to the source address in this case. Secondly, after one thread has exchanged the pointer locally, threads may collect waiting on the handlers to become available. This can happen when the wait on the handlers takes sufficient amount of time during which both handlers and cache pointer are invalid. After waiting, the responsible thread must therefore signal all \cite{cppreference:atomic-notify-all} threads waiting on the handler to continue. \par
Some two additional cases must be considered for the latter implementation to be safe. The wait operation first checks for a valid cache pointer and then waits on the handlers becoming valid. After processing the handlers, they are deleted and the pointer therefore invalidated. Should the cache pointer now be invalid as well, deadlocks would ensue. Therefore, the thread which exchanged the handlers pointer for a valid local copy must set the cache pointer to a valid value. Should one of the offloaded operations have failed, using the cache pointer is out of question as the datum it references might be invalid. The cache is set to the source address in this case. Secondly, after one thread has exchanged the pointer locally, threads may collect waiting on the handlers to become available. This can happen when the wait on the handlers takes sufficient amount of time during which both handlers and cache pointer are invalid. After waiting, the responsible thread must therefore signal all \cite{cppreference:atomic-notify-all} threads waiting on the handler to continue. \par
Two types of deadlocks were encountered during testing and have been accounted for. On one hand, it was found that the guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} is stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify_all}\cite{cppreference:atomic-notify-all}. Therefore the value of the handler pointer may not be exchanged with nullptr which is the value we wait on. As the highest envisioned address requires the lower 52-bits of current 64-bit wide systems \cite[p. 120]{amd:programmers-manual}\cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an invalid pointer which is used as the second invalid state possible. The second type was encountered when after creating a \texttt{CacheData} instance it was determined this exists in the cache already and dropped. As destruction waits on completion in order to ensure that no further jobs require the memory held, a deadlock would arise from the cache and handler pointers both being null and no handlers ever being set due to the instance being deleted immediately. To circumvent this, the constructor of \texttt{CacheData} was modified to point to source memory by default. Only after calling a separate initalization function will \texttt{CacheData} replace this with nullptr, therefore readying the instance for multithreaded usage. \par
Two types of deadlocks were encountered during testing and have been accounted for. On one hand, it was found that the guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} is stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all}\cite{cppreference:atomic-notify-all}. The value of the handler pointer may therefore not be exchanged with nullptr which is the value we wait on. As the highest envisioned address requires the lower 52-bits of current 64-bit wide systems \cite[p. 120]{amd:programmers-manual}\cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an invalid pointer which is used as the second invalid state possible. The second type was encountered when after creating a \texttt{CacheData} instance it was determined this exists in the cache already and dropped. As destruction waits on completion in order to ensure that no further jobs require the memory held, a deadlock would arise from the cache and handler pointers both being null and no handlers ever being set due to the instance being deleted immediately. To circumvent this, the constructor of \texttt{CacheData} was modified to point to source memory by default. Only after calling a separate initialization function will \texttt{CacheData} replace this with nullptr, therefore readying the instance for multithreaded usage. \par
\subsection{Performance Guideline}
\subsection{Performance Guideline}
Atomic operations come with an added performance penalty. No recent studies were found on this, although we assume that the findings of Hermann Schweizer et al. in "Evaluating the Cost of Atomic Operations on Modern Architectures"\cite{atomics-cost-analysis} still hold true for todays processor architectures. Due to the inherent cache synchronization mechanisms in place \cite[Subsection IV.A.3 Off-Die Access]{atomics-cost-analysis}, they observed significant access latency increase depending on whether the atomic variable was located on the local core, on a different core on the same chip or on another socket \cite[Fig. 4]{atomics-cost-analysis}. Reducing the cost of atomic accesses would require a less generic implementation, reducing some of the guarantees we give in \ref{sec:design:cache}. This would allow reducing the amount of atomics required but is outside of the scope of this work. \par
Atomic operations come with an added performance penalty. No recent studies were found on this, although we assume that the findings of Hermann Schweizer et al. in “Evaluating the Cost of Atomic Operations on Modern Architectures“\cite{atomics-cost-analysis} still hold true for today's processor architectures. Due to the inherent cache synchronization mechanisms in place \cite[Subsection IV.A.3 Off-Die Access]{atomics-cost-analysis}, they observed significant access latency increase depending on whether the atomic variable was located on the local core, on a different core on the same chip or on another socket \cite[Fig. 4]{atomics-cost-analysis}. Reducing the cost of atomic accesses would require a less generic implementation, reducing some of the guarantees we give in \ref{sec:design:cache}. This would allow reducing the amount of atomics required but is outside the scope of this work. \par
With the distributed locking described in \ref{subsec:implementation:cache-state-lock}, lock contention should not have a significant impact, although this remains to be tested. In addition to that, passive waiting at the contended section will benefit other threads and might allow overall progress to continue, if the application utilizing the cache has a threading model supporting this. These two factors lead us to classify lock contention as only a minor performance problem. \par
With the distributed locking described in \ref{subsec:implementation:cache-state-lock}, lock contention should not have a significant impact, although this remains to be tested. In addition to that, passive waiting at the contended section will benefit other threads and might allow overall progress to continue, if the application utilizing the cache has a threading model supporting this. These two factors lead us to classify lock contention as only a minor performance problem. \par
\section{Accelerator Usage}
\section{Accelerator Usage}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, whicht is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, which is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.