apply recommendations of andre to chapter 5

11 months ago · 7572350b28
1 changed files with 36 additions and 11 deletions
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -38,13 +38,11 @@ Even with this optimization, in scenarios where the \texttt{Cache} is frequently
 The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
 As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
 As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
 Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
 Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared (though a standard pointer) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
 Additionally, the invalid state of \texttt{CacheData} is avoided. To achieve this, the waiting algorithm requires the handlers to be contained in an atomic pointer, and the pointer to the cache memory must be atomic too. This enables the use of the atomic wait operation, which is guaranteed by the standard to be more efficient than simply spinning on Compare-And-Swap \cite{cppreference:atomic-wait}. Some standard implementations achieve this by yielding after a short spin cycle \cite{atomic-wait-details}. \par
 Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
 \begin{figure}[H]
    \centering
@ -53,7 +51,9 @@ Additionally, the invalid state of \texttt{CacheData} is avoided. To achieve thi
    \label{fig:impl-cachedata-threadseq-waitoncompletion}
 \end{figure}
 Designing the wait to work from any thread was complicated. In the first implementation, a thread would check if the handlers are available and if not atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. Lets assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) now is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Now \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Now only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
 Due to the possibility of access by multiple threads, the implementation of \texttt{CacheData::WaitOnCompletion} proved to be challenging. In the first implementation, a thread would check if the handlers are available and atomically wait \cite{cppreference:atomic-wait} on a value change from nullptr, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. \par
 To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\)  is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a nullptr for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
 \begin{figure}[H]
    \centering
@ -62,17 +62,42 @@ Designing the wait to work from any thread was complicated. In the first impleme
    \label{fig:impl-cachedata-waitoncompletion}
 \end{figure}
 To solve this, a different and more complicated order of waiting operations is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. Let's take the same example as before to illustrate the second part of the waiting procedure. \(T_2\) and \(T_3\) now both arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one} \cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Now each thread again checks whether it has received a valid local pointer to the handlers from the exchange, if it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
 As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one} \cite{cppreference:atomic-notify-one} to wake at least one thread waiting on value change of the handlers pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers pointer is atomically exchanged \cite{cppreference:atomic-exchange} with nullptr, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
 Additional cases must be considered for the latter implementation to be safe and free of deadlocks. We will now discuss these edge cases and their resolution. 
 \subsubsection{Initial Invalid State}
 \label{subsubsec:impl:cdatomicity:initial-invalid-state}
 We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from nullptr to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
 \subsubsection{Invalid State on Immediate Destruction}
 The previous Section discussed the initial invalid state and noted that, as long as the handlers will be set in the future, progress is guaranteed. We now discuss the situation where handlers will not be set. This situation is encountered when a memory region is accessed by threads \(T_1\) and \(T_2\) concurrently. One will win the data race to add the entry to the cache state, we choose \(T_1\). \(T_2\) then must follow Section \ref{subsec:design:cache-entry-reuse} and return the entry already present in cache state. Therefore, \(T_2\) has to destroy the \texttt{CacheData} instance it created previously. \par
 The destructor of \texttt{CacheData} waits on operation completion in order to ensure that no running jobs require the cache memory region, before deallocating it. This necessitates usability of \texttt{CacheData::WaitOnCompletion} for the case of immediate destruction. As the instance of \texttt{CacheData} is destroyed immediately, no tasks will be submitted to the \gls{dsa} and therefore handlers never become available, leading to deadlock on destruction. \par
 To circumvent this deadlock, the initial state of \texttt{CacheData} was modified to be safe for deletion. An initialization function was added to \texttt{CacheData}, which is required to be called when the instance is to be used. \par
 \subsubsection{Invalid State on Operation Failure}
 \texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to nullptr using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used. In this case, both the handlers and the cache pointer will be nullptr. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cdatomicity:initial-invalid-state}. \par
 In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still nullptr and as the handlers have already been set and processed, they will also be nullptr without the chance of them ever becoming valid. \par
 Edge case handling is introduced and the cache pointer is set to the source address, providing validity. \par
 \subsubsection{Locally Invalid State due to Race Condition}
 Some two additional cases must be considered for the latter implementation to be safe. The wait operation first checks for a valid cache pointer and then waits on the handlers becoming valid. After processing the handlers, they are deleted and the pointer therefore invalidated. Should the cache pointer now be invalid as well, deadlocks would ensue. Therefore, the thread which exchanged the handlers pointer for a valid local copy must set the cache pointer to a valid value. Should one of the offloaded operations have failed, using the cache pointer is out of question as the datum it references might be invalid. The cache is set to the source address in this case. Secondly, after one thread has exchanged the pointer locally, threads may collect waiting on the handlers to become available. This can happen when the wait on the handlers takes sufficient amount of time during which both handlers and cache pointer are invalid. After waiting, the responsible thread must therefore signal all \cite{cppreference:atomic-notify-all} threads waiting on the handler to continue. \par
 The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. \par
 Two types of deadlocks were encountered during testing and have been accounted for. On one hand, it was found that the guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} is stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. The value of the handler pointer may therefore not be exchanged with nullptr which is the value we wait on. As the highest envisioned address requires the lower 52-bits of current 64-bit wide systems \cite[p. 120]{amd:programmers-manual} \cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an invalid pointer which is used as the second invalid state possible. The second type was encountered when after creating a \texttt{CacheData} instance it was determined this exists in the cache already and dropped. As destruction waits on completion in order to ensure that no further jobs require the memory held, a deadlock would arise from the cache and handler pointers both being null and no handlers ever being set due to the instance being deleted immediately. To circumvent this, the constructor of \texttt{CacheData} was modified to point to source memory by default. Only after calling a separate initialization function will \texttt{CacheData} replace this with nullptr, therefore readying the instance for multithreaded usage. \par
 As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is nullptr, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers pointer with nullptr, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
 \subsection{Performance Guideline}
 This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. A solution for this is to not exchange the handlers pointer with nullptr but with a second invalid value. \par
 Atomic operations come with an added performance penalty. No recent studies were found on this, although we assume that the findings of Hermann Schweizer et al. in “Evaluating the Cost of Atomic Operations on Modern Architectures“ \cite{atomics-cost-analysis} still hold true for today's processor architectures. Due to the inherent cache synchronization mechanisms in place \cite[Subsection IV.A.3 Off-Die Access]{atomics-cost-analysis}, they observed significant access latency increase depending on whether the atomic variable was located on the local core, on a different core on the same chip or on another socket \cite[Fig. 4]{atomics-cost-analysis}. Reducing the cost of atomic accesses would require a less generic implementation, reducing some of the guarantees we give in \ref{sec:design:cache}. This would allow reducing the amount of atomics required but is outside the scope of this work. \par
 We must therefore determine a secondary invalid pointer. As the largest accessible memory location on modern 64-bit-systems requires only the lower 52-bits \cite[p. 120]{amd:programmers-manual} \cite[p. 4-2]{intel:programmers-manual} setting all bits of a 64-bit-value yields an inaccessible address which is therefore used as the second invalid state possible. Figure \ref{fig:impl-cachedata-waitoncompletion} refers to this as \enquote{uint64::max}. \par
 With the distributed locking described in \ref{subsec:implementation:cache-state-lock}, lock contention should not have a significant impact, although this remains to be tested. In addition to that, passive waiting at the contended section will benefit other threads and might allow overall progress to continue, if the application utilizing the cache has a threading model supporting this. These two factors lead us to classify lock contention as only a minor performance problem. \par
 This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers pointer for validity. The invalid state now includes both nullptr and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
 \section{Accelerator Usage}