@ -44,7 +44,7 @@ As caching is performed asynchronously, the user may wish to wait on the operati
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par
By allowing multiple references to the same entry, memory management becomes a concern. Freeing the allocated block must only take place when all copies of a \texttt{CacheData} instance are destroyed, therefore tying cache entry lifetime to the lifetime of the longest living copy of the original instance. This makes access to the entry legal during the lifetime of any \texttt{CacheData} instance, while also guaranteeing that \texttt{Cache::Clear} will not have any unforseen side effects, as deallocation only takes place when the last consumer has \texttt{CacheData} go out of scope or manually deletes it. \par
@ -58,10 +58,6 @@ Secondly, invalidation is to be performed manually, requiring the programmer to
Due to its reliance on libnuma for memory allocation and thread pinning, \texttt{Cache} will only work on systems where this library is present, excluding, most notably, Windows from the compatibility list. \par
\subsection{Thread Safety Guarantees}
After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{chap:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml}\cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml}\cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}. \par
As the usage of locking and atomics may have a significant impact on performance, their application will be discussed in detail within this section. \todo{extend introductory paragraph}\par
The usage of locking and atomics proved to be the challenging. Their use is performance critical and mistakes may lead to deadlock. Therefore they also constitute the most interesting part of the implementation which is why this chapter will focus extensively on the details of the implementation in regard to these.\par
\subsection{Cache State Lock}\label{subsec:implementation:cache-state-lock}
@ -34,9 +34,19 @@ A map was chosen to represent the current cache state with the key being the mem
Even with this optimization, in scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node lock contention will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
\subsection{CacheData Reference Counting}
\subsection{CacheData Atomicity}
\subsection{CacheData WaitOnCompletion}
The choice made in \ref{subsec:design:cache-entry-reuse} requires thread safe shared access to the same resource. \texttt{std::shared_ptr,T>} provides a reference counted pointer, which is thread safe for the required operations, making it a prime candidate for this task. An implementation using it was explored but proved to offer its own set of challenges. As we wish to reduce time spent in a locked region, the task is only added to the nodes cache state when locked. Submission takes place outside, which is sensible, as this submitting one task should not hinder accessing another. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse} this would require the threads to coordinate which thread performs the actual waiting, as we assume the handlers of \gls{intel:dml} to be non-threadsafe. In order to avoid queuing multiple of the same copies, the task must be added before submission. This results in a \texttt{CacheData} instance with invalid cache pointer and no handlers to wait for being available, requiring additional usage of synchronization primitives. With using \texttt{std::shared_ptr<T>} also comes the uncertainty of relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in korean. No further research was found on this topic. \par
It was therefore decided to implement atomic reference counting for \texttt{CacheData} which means providing a custom constructor and destructor wherein a shared (through a standard pointer however) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to increase or deacrease the reference counter and, in case of deacrese in the destructor signals that the destructor is called for the last reference, perform actual destruction. The invalid state of \texttt{CacheData} achievable is also avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer and the pointer to the cache memory be atomic too. Through this we may use the atomic wait operation which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads T_1, T_2 and T_3 wish to access the same resource. T_1 now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before T_1 may submit the work, it is interrupted and T_2 and T_3 obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer. Now T_1 submits the work and sets the handlers, while T_2 and T_3 continue to wait. Now only T_1 can trigger the waiting and is therefore capable of keeping T_2 and T_3 from progressing. This is undesirable as it can lead to deadlocking if by some reason T_1 does not wait and at least may lead to unneccessary delay for T_2 and T_3 if T_1 does not wait immediately. \par
To solve this, a different and more complicated order of waiting operations is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. Lets take the same example as before to illustrate the second part of the waiting procedure. T_2 and T_3 now both arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers pointer to change, instead of doing it the other way around as before. Now when T_1 supplies the handlers, it also uses \texttt{std::atomic<T>::notify_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Now each thread again checks whether it has received a valid local pointer to the handlers from the exchange, if it has then the atomic operation guarantees that is is now in sole posession of the pointer and therefore tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
Some two additional cases must be considered for the latter implementation to be safe. The wait operation first checks for a valid cache pointer and then waits on the handlers becoming valid. After processing the handlers, they are deleted and the pointer therefore invalidated. Should the cache pointer now be invalid as well, deadlocks would ensue. Therefore the thread which exchanged the handlers pointer for a valid local copy must set the cache pointer to a valid value. Should one of the offloaded operations have failed, using the cache pointer is out of question as the datum it references might be invalid. Therefore the cache is set to the source address in this case. Secondly, after one thread has exchanged the pointer locally, threads may collect waiting on the handlers to become available. This can happen when the wait on the handlers takes sufficient amount of time during which both handlers and cache pointer are invalid. After waiting, the responsible thread must therefore signal all \cite{cppreference:atomic-notify-all} threads waiting on the handler to continue. \par
Two types of deadlocks were encountered during testing and have been accounted for. On one hand, it was found that the guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} is stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify_all}\cite{cppreference:atomic-notify-all}. Therefore the value of the handler pointer may not be exchanged with nullptr which is the value we wait on. As the highest envisioned address requires the lower 52-bits of current 64-bit wide systems \cite[p. 120]{amd:programmers-manual}\cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an invalid pointer which is used as the second invalid state possible. The second type was encountered when after creating a \texttt{CacheData} instance it was determined this exists in the cache already and dropped. As destruction waits on completion in order to ensure that no further jobs require the memory held, a deadlock would arise from the cache and handler pointers both being null and no handlers ever being set due to the instance being deleted immediately. To circumvent this, the constructor of \texttt{CacheData} was modified to point to source memory by default. Only after calling a separate initalization function will \texttt{CacheData} replace this with nullptr, therefore readying the instance for multithreaded usage. \par