finish the section on using cache with qdp, also add a section on possible usage of dwcas for aba problem solution

11 months ago · 3fc9bc6145
2 changed files with 26 additions and 30 deletions
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -36,9 +36,11 @@ Even with this optimization, in scenarios where the \texttt{Cache} is frequently

 \subsection{CacheData Atomicity}

+Throughout this section we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. 
+
 The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par

-As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
+As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added to the cache state before submission. These two aspects necessitate modifying the handlers atomically. We therefore use an atomic pointer in \texttt{CacheData} to allow safe exchange and waiting on modification. \par

 Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par

@ -62,9 +64,9 @@ To illustrate this, an exemplary scenario is used, as seen in the sequence diagr
    \label{fig:impl-cachedata-waitoncompletion}
 \end{figure}

-As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one} \cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
+As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one} \cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par

-Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution. 
+Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution. \par

 \subsubsection{Initial Invalid State}
 \label{subsubsec:impl:cdatomicity:initial-invalid-state}
@ -91,13 +93,15 @@ Edge case handling is introduced and the cache pointer is set to the source addr

 The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. \par

-As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
+As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
+
+This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par

-This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. A solution for this is to not exchange the handlers pointer with nullptr but with a second invalid value. \par
+One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of nullptr and flag-set. Other threads then would then wait on the struct changing from nullptr and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp} \par

-We must therefore determine a secondary invalid pointer. As the largest accessible memory location on modern 64-bit-systems requires only the lower 52-bits \cite[p. 120]{amd:programmers-manual} \cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an inaccessible address which is therefore used as the second invalid state possible. Figure \ref{fig:impl-cachedata-waitoncompletion} refers to this as \enquote{uint64::max}. \par
+The chosen solution for this is to not exchange the handlers-pointer with nullptr but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par

-This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers pointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
+This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par

 \section{Accelerator Usage}

@ -107,7 +111,9 @@ After \ref{sec:design:accel-usage} the implementation of \texttt{Cache} provided

 Applying the \texttt{Cache} to \gls{qdp} is straightforward. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, calling \texttt{Cache::Access} for both prefetching and cache access. \par

-\todo{write about modifications}
+During performance analysis of the developed \texttt{Cache}, we found that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead busy-waits on the completion descriptor being updated. As this busy waiting costs CPU cycles, waiting on task completion is out of question, requiring code modifications which we now detail. \par
+
+We extended \texttt{CacheData} and \texttt{Cache} to provide support for weak waiting. Through a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created through \texttt{Cache::Access} will only check once whether the \gls{dsa} has completed processing the operation, and otherwise return without updating the cache-pointer. Calls to \texttt{CacheData::GetDataLocation} can therefore return nullptr, even after waiting. The user is then responsible to access the data through its source location. For latency-critical applications, \texttt{Cache::Access} provides the option for weak access. When set, the function will only return an existing instance of \texttt{CacheData} and therefore does not cause work submission to the \gls{dsa}. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"
--- a/thesis/own.bib
+++ b/thesis/own.bib
@ -135,24 +135,6 @@
    howpublished = {\url{https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc}}
 }

-@misc{amd:programmers-manual,
-    author = {AMD},
-    publisher = {AMD},
-    title = {{AMD64 Programmer's Manual Volume 2: System Programming}},
-    date = {2016-12},
-    urldate = {2024-01-18},
-    howpublished = {\url{https://support.amd.com/TechDocs/24593.pdf}}
-}
-
-@misc{intel:programmers-manual,
-    author = {Intel},
-    publisher = {Intel},
-    title = {{Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1}},
-    date = {2016-12},
-    urldate = {2024-01-18},
-    howpublished = {\url{https://support.amd.com/TechDocs/24593.pdf}}
-}
-
@INPROCEEDINGS{hbm-arch-paper,
    author={Jun, Hongshin and Cho, Jinhee and Lee, Kangseol and Son, Ho-Young and Kim, Kwiwook and Jin, Hanho and Kim, Keith},
    booktitle={2017 IEEE International Memory Workshop (IMW)},
@ -239,8 +221,16 @@
 }

@misc{xeonmax-peakthroughput,
-  author = "André Berthold and Anna Bartuschka",
-  title={{Throughput Benchmarks for CPU}},
-  date = "2023",
-  howpublished = "personal communication"
+    author = "André Berthold and Anna Bartuschka",
+    title={{Throughput Benchmarks for CPU}},
+    date = "2023",
+    howpublished = "personal communication"
+}
+
+@misc{dwcas-cpp,
+    author = {Timur Doumler},
+    title = {{DWCAS in C++}},
+    date = {2022-03-31},
+    howpublished = {\url{https://timur.audio/dwcas-in-c}},
+    urldate = {2024-02-07}
 }