% nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
% auf denen man sie sich ansehen kann.
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Subsequently, we provide an example of the policy functions alluded to in Section \ref{sec:design:accel-usage}. Finally, we apply the cache to \glsentrylong{qdp}. \par
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Finally, we apply the cache to \glsentrylong{qdp}, detailing the policies mentioned in Section \ref{sec:design:accel-usage} and presenting solutions for the challenges encountered. \par
\section{Locking and Usage of Atomics}
@ -53,9 +53,9 @@ Therefore, the decision was made to implement atomic reference counting for \tex
Due to the possibility of access by multiple threads, the implementation of \texttt{CacheData::WaitOnCompletion} proved to be challenging. In the first implementation, a thread would check if the handlers are available and atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. \par
Due to the possibility of access by multiple threads, the implementation of \texttt{CacheData::WaitOnCompletion} proved to be challenging. In the first implementation, a thread would check if the handlers are available and atomically wait \cite{cppreference:atomic-wait} on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. \par
To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
\begin{figure}[h!tb]
\centering
@ -64,14 +64,14 @@ To illustrate this, an exemplary scenario is used, as seen in the sequence diagr
\label{fig:impl-cachedata-waitoncompletion}
\end{figure}
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one}\cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with \texttt{nullptr}, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution. \par
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from nullptr to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from \texttt{nullptr} to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
\subsubsection{Invalid State on Immediate Destruction}
@ -83,9 +83,9 @@ To circumvent this deadlock, the initial state of \texttt{CacheData} was modifie
\subsubsection{Invalid State on Operation Failure}
\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to nullptr using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be nullptr. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par
\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to \texttt{nullptr} using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be \texttt{nullptr}. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par
In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still nullptr and as the handlers have already been set and processed, they will also be nullptr without the chance of them ever becoming valid. \par
In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still \texttt{nullptr} and as the handlers have already been set and processed, they will also be \texttt{nullptr} without the chance of them ever becoming valid. \par
Edge case handling is introduced and the cache pointer is set to the source address, providing validity. \par
@ -93,27 +93,23 @@ Edge case handling is introduced and the cache pointer is set to the source addr
The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all}\cite{cppreference:atomic-notify-all}. \par
As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par
One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of nullptr and flag-set. Other threads then would then wait on the struct changing from nullptr and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp}\par
One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of \texttt{nullptr} and flag-set. Other threads then would then wait on the struct changing from \texttt{nullptr} and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp}\par
The chosen solution for this is to not exchange the handlers-pointer with nullptr but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par
The chosen solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
\section{Accelerator Usage}
After \ref{sec:design:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8=11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the Push-Pull method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. We introduce this distinction by datum size due to the observations in Section \ref{subsec:perf:submitmethod} that with lower transfer sizes, the effect of submission cost is amplified. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
\section{Application to \glsentrylong{qdp}}
Applying the \texttt{Cache} to \gls{qdp} is straightforward. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, calling \texttt{Cache::Access} for both prefetching and cache access. \par
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
During performance analysis of the developed \texttt{Cache}, we found that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead busy-waits on the completion descriptor being updated. As this busy waiting costs CPU cycles, waiting on task completion is out of question, requiring code modifications which we now detail. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications, which we elaborate on below. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in Cache, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. \par
We extended \texttt{CacheData} and \texttt{Cache} to provide support for weak waiting. Through a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created through \texttt{Cache::Access} will only check once whether the \gls{dsa} has completed processing the operation, and otherwise return without updating the cache-pointer. Calls to \texttt{CacheData::GetDataLocation} can therefore return nullptr, even after waiting. The user is then responsible to access the data through its source location. For latency-critical applications, \texttt{Cache::Access} provides the option for weak access. When set, the function will only return an existing instance of \texttt{CacheData} and therefore does not cause work submission to the \gls{dsa}. \par
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. \par