@ -36,9 +36,11 @@ Even with this optimization, in scenarios where the \texttt{Cache} is frequently
\subsection{CacheData Atomicity}
Throughout this section we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers.
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added to the cache state before submission. These two aspects necessitate modifying the handlers atomically. We therefore use an atomic pointer in \texttt{CacheData} to allow safe exchange and waiting on modification. \par
Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
@ -62,9 +64,9 @@ To illustrate this, an exemplary scenario is used, as seen in the sequence diagr
\label{fig:impl-cachedata-waitoncompletion}
\end{figure}
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlerspointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlerspointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlerspointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution.
Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution. \par
@ -91,13 +93,15 @@ Edge case handling is introduced and the cache pointer is set to the source addr
The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all}\cite{cppreference:atomic-notify-all}. \par
As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par
This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. A solution for this is to not exchange the handlers pointer with nullptr but with a second invalid value. \par
One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of nullptr and flag-set. Other threads then would then wait on the struct changing from nullptr and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp}\par
We must therefore determine a secondary invalid pointer. As the largest accessible memory location on modern 64-bit-systems requires only the lower 52-bits \cite[p. 120]{amd:programmers-manual}\cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an inaccessible address which is therefore used as the second invalid state possible. Figure \ref{fig:impl-cachedata-waitoncompletion} refers to this as \enquote{uint64::max}. \par
The chosen solution for this is to not exchange the handlers-pointer with nullptr but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlerspointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
\section{Accelerator Usage}
@ -107,7 +111,9 @@ After \ref{sec:design:accel-usage} the implementation of \texttt{Cache} provided
Applying the \texttt{Cache} to \gls{qdp} is straightforward. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, calling \texttt{Cache::Access} for both prefetching and cache access. \par
\todo{write about modifications}
During performance analysis of the developed \texttt{Cache}, we found that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead busy-waits on the completion descriptor being updated. As this busy waiting costs CPU cycles, waiting on task completion is out of question, requiring code modifications which we now detail. \par
We extended \texttt{CacheData} and \texttt{Cache} to provide support for weak waiting. Through a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created through \texttt{Cache::Access} will only check once whether the \gls{dsa} has completed processing the operation, and otherwise return without updating the cache-pointer. Calls to \texttt{CacheData::GetDataLocation} can therefore return nullptr, even after waiting. The user is then responsible to access the data through its source location. For latency-critical applications, \texttt{Cache::Access} provides the option for weak access. When set, the function will only return an existing instance of \texttt{CacheData} and therefore does not cause work submission to the \gls{dsa}. \par