You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
112 lines
21 KiB
112 lines
21 KiB
\chapter{Implementation}
|
|
\label{chap:implementation}
|
|
|
|
% Hier greift man einige wenige, interessante Gesichtspunkte der
|
|
% Implementierung heraus. Das Kapitel darf nicht mit Dokumentation oder
|
|
% gar Programmkommentaren verwechselt werden. Es kann vorkommen, daß
|
|
% sehr viele Gesichtspunkte aufgegriffen werden müssen, ist aber nicht
|
|
% sehr häufig. Zweck dieses Kapitels ist einerseits, glaubhaft zu
|
|
% machen, daß man es bei der Arbeit nicht mit einem "Papiertiger"
|
|
% sondern einem real existierenden System zu tun hat. Es ist sicherlich
|
|
% auch ein sehr wichtiger Text für jemanden, der die Arbeit später
|
|
% fortsetzt. Der dritte Gesichtspunkt dabei ist, einem Leser einen etwas
|
|
% tieferen Einblick in die Technik zu geben, mit der man sich hier
|
|
% beschäftigt. Schöne Bespiele sind "War Stories", also Dinge mit denen
|
|
% man besonders zu kämpfen hatte, oder eine konkrete, beispielhafte
|
|
% Verfeinerung einer der in Kapitel 3 vorgestellten Ideen. Auch hier
|
|
% gilt, mehr als 20 Seiten liest keiner, aber das ist hierbei nicht so
|
|
% schlimm, weil man die Lektüre ja einfach abbrechen kann, ohne den
|
|
% Faden zu verlieren. Vollständige Quellprogramme haben in einer Arbeit
|
|
% nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
|
|
% auf denen man sie sich ansehen kann.
|
|
|
|
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. First, we delve into the usage of locking and atomics to achieve thread safety and then apply the cache to \glsentrylong{qdp}, thereby detailing the policies mentioned in Section \ref{subsec:design:policy-functions} and presenting solutions for the challenges encountered.\par
|
|
|
|
\section{Synchronization for Cache and CacheData}
|
|
|
|
The usage of locking and atomics to achieve safe concurrent access has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will focus on the synchronization techniques employed. \par
|
|
|
|
Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator, exposing the completion state of the asynchronously executed task. Use of a handler is also displayed in the \texttt{memcpy}-function for the \gls{dsa} as shown in Figure \ref{fig:dml-memcpy}. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par
|
|
|
|
\subsection{Cache: Locking for Access to State}
|
|
\label{subsec:impl:cache-state-lock}
|
|
|
|
To keep track of the current cache state, the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:interface} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second design decision requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. Both flushing and clear operations necessitate unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required when adding the newly created entry to cache. \par
|
|
|
|
A map was chosen as the data structure to represent the current cache state with the key being the memory address of the entry, holding a \texttt{CacheData} instance as value. Thereby we achieve the promise of keeping entries cached whenever possible. With the \texttt{Cache} keeping an internal reference, the last reference to \texttt{CacheData} will only be destroyed upon removal of the cache state. This removal can either be triggered implicitly by flushing when encountering \gls{mempress} during a cache access, or explicitly by calls to \texttt{Cache::Flush} or \texttt{Cache::Clear}. \par
|
|
|
|
With the caching policy being controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This also reduces the potential for lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. Even with this optimization, scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads from the same node, will lead to threads contending for unique access to the state. \par
|
|
|
|
\subsection{CacheData: Shared Reference}
|
|
\label{subsec:impl:cachedata-ref}
|
|
|
|
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to one instance of \texttt{CacheData}. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. Using \texttt{std::shared\_ptr<T>} additionally introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations. Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
|
|
|
|
\subsection{CacheData: Fair and Threadsafe WaitOnCompletion}
|
|
|
|
As Section \ref{subsec:design:cache-entry-reuse} details, we intend to share a cache entry between multiple threads, necessitating synchronization between multiple instances of \texttt{CacheData} for waiting on operation completion and determining the cache location. While the latter is easily achieved by making the cache pointer a shared atomic value, the former proved to be challenging. Therefore, we will iteratively develop our implementation for \texttt{CacheData::WaitOnCompletion} in this Section. We present the challenges encountered and describe how these were solved to achieve the fairness and thread safety we desire. \par
|
|
|
|
We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads, as no guarantees were found in the documentation. To achieve the safety for \texttt{CacheData::WaitOnCompletion}, outlined in Section \ref{subsec:design:cache-entry-reuse}, threads need to coordinate on a master thread which performs the actual waiting, while the others wait on the master. \par
|
|
|
|
Upon call to \texttt{Cache::Access}, coordination must take place to only add one instance of \texttt{CacheData} to the cache state and, most importantly, submit to the \gls{dsa} only once. This behaviour was elected in Section \ref{subsec:design:cache-entry-reuse} to reduce the load placed on the accelerator by preventing duplicate submission. To solve this, \texttt{Cache::Access} will add the instance to the cache state under unique lock (see Section \ref{subsec:impl:cache-state-lock} above) and only then, when the current thread modified the cache state, submission will take place. This presents a data race, with one thread successfully adding a new entry to the cache. This thread is then chosen to perform the submission. Through this two-stepped process, the state may contain \texttt{CacheData} without a valid cache pointer and no handlers available to wait on. As the following paragraph and sections demonstrate, this resulted in implementation challenges. \par
|
|
|
|
\begin{figure}[t]
|
|
\centering
|
|
\includegraphics[width=0.9\textwidth]{images/seq-blocking-wait.pdf}
|
|
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\). \(T_2\) calls \texttt{CacheData::WaitOnCompletion} before \(T_1\) submits the work and adds the handlers to the shared instance. Thereby, \(T_2\) gets stuck waiting on the cache pointer to be updated, making \(T_1\) responsible for permitting \(T_2\) to progress.}
|
|
\label{fig:impl-cachedata-threadseq-waitoncompletion}
|
|
\end{figure}
|
|
|
|
In the first implementation, a thread would check if the handlers are available and wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that two threads \(T_1\) and \(T_2\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) obtains access to the incomplete \texttt{CacheData} on which it calls \texttt{CacheData::WaitOnCompletion}, causing it to see \texttt{nullptr} for the cache pointer and handlers. \(T_2\) then waits on the cache pointer becoming valid, followed by \(T_1\) submitting the work and setting the handlers. Therefore, only \(T_1\) can trigger the waiting on the handlers and is capable of keeping \(T_2\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\). \par
|
|
|
|
\pagebreak
|
|
|
|
\begin{figure}[t]
|
|
\centering
|
|
\includegraphics[width=0.75\textwidth]{images/nsd-cachedata-waitoncompletion.pdf}
|
|
\caption{\texttt{CacheData::WaitOnCompletion} Pseudocode. Final rendition of the implementation for a fair wait function.}
|
|
\label{fig:impl-cachedata-waitoncompletion}
|
|
\end{figure}
|
|
|
|
To resolve this, a more intricate implementation is required, for which Figure \ref{fig:impl-cachedata-waitoncompletion} shows pseudocode. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does. We will use the same example as before to illustrate the second part of the waiting procedure. Thread \(T_2\) arrives in this latter section as the cache was invalid at the point in time when waiting was called for. They now wait on the handlers-pointer to change. When \(T_1\) supplies the handlers after submitting work, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer. Through this, the exclusion that was observable in the first implementation is already avoided. The handlers will be atomically set to a valid pointer and a thread may pass the wait. Following this, the handlers-pointer is atomically exchanged, invalidating it and assigning the previous valid value to a local variable. Each thread checks whether it has received a valid pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer and the designated master-thread. The master is tasked with waiting, while other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again, leading to a wait on the master thread setting the cache to a valid value. \par
|
|
|
|
\subsection{CacheData: Edge Cases and Deadlocks}
|
|
\label{subsec:impl:cachedata-deadlocks}
|
|
|
|
With the outlines of a fair implementation of \texttt{CacheData::WaitOnCompletion} drawn, we will now move our focus to the safety of \texttt{CacheData}. Specifically the following Sections will discuss possible deadlocks and their resolution. \par
|
|
|
|
\subsubsection{Initial Invalid State}
|
|
\label{subsubsec:impl:cachedata:initial-invalid-state}
|
|
|
|
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from \texttt{nullptr} to valid, which can be seen in Figure \ref{fig:impl-cachedata-waitoncompletion}. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
|
|
|
|
\subsubsection{Invalid State on Immediate Destruction}
|
|
|
|
The previous Section discussed the initial invalid state and noted that, as long as the handlers will be set in the future, progress is guaranteed. We now discuss the situation where handlers will not be set. This situation is encountered when a memory region is accessed by threads \(T_1\) and \(T_2\) concurrently. One will win the data race to add the entry to the cache state, we choose \(T_1\). \(T_2\) then must follow Section \ref{subsec:design:cache-entry-reuse} and return the entry already present in cache state. Therefore, \(T_2\) has to destroy the \texttt{CacheData} instance it created previously. \par
|
|
|
|
The destructor of \texttt{CacheData} waits on operation completion in order to ensure that no running jobs require the cache memory region, before deallocating it. This necessitates usability of \texttt{CacheData::WaitOnCompletion} for the case of immediate destruction. As the instance of \texttt{CacheData} is destroyed immediately, no tasks will be submitted to the \gls{dsa} and therefore handlers never become available, leading to deadlock on destruction, due to the wait on handlers as shown in Figure \ref{fig:impl-cachedata-waitoncompletion}. To circumvent this, the initial state of \texttt{CacheData} was modified to be safe for deletion. An initialization function was added to \texttt{CacheData}, which is required to be called when the instance is to be used. \par
|
|
|
|
\subsubsection{Invalid State on Operation Failure}
|
|
|
|
\texttt{CacheData::WaitOnCompletion} first checks for a valid cache pointer and then waits on the handlers becoming valid. To process the handlers, the global atomic pointer is read into a local copy and then set to \texttt{nullptr} using \texttt{std::atomic<T>::exchange}. During evaluation of the handlers completion states, an unsuccessful operation may be found. In this case, the cache memory region remains invalid and may therefore not be used, resulting in both the handlers and the cache pointer being \texttt{nullptr}. This results in an invalid state, like the one discussed in Section \ref{subsubsec:impl:cachedata:initial-invalid-state}. In this invalid state, progress is not guaranteed by the measures set forth to handle the initial invalidity. The cache is still \texttt{nullptr} and as the handlers have already been set and processed, they will also be \texttt{nullptr} without the chance of them ever becoming valid. The employed solution can be seen in Figure \ref{fig:impl-cachedata-waitoncompletion}, where after processing the handlers we check for errors in their results and upon encountering an error, the cache pointer is set to the data source. Other threads could have accumulated, waiting for the cache to become valid, as the handlers were already invalidated before. By setting the cache to a valid value, these threads are released and any subsequent calls to \texttt{CacheData::WaitOnCompletion} will immediately return. Therefore, the \texttt{Cache} does not guarantee that data will actually be cached, however, as Chapter \ref{chap:design} does not make such a promise, we decided on implementing this edge case handling. \par
|
|
|
|
\subsubsection{Locally Invalid State due to Race Condition}
|
|
|
|
The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par
|
|
|
|
One could use double-width atomic operations which would allow resetting the pointer back to \texttt{nullptr} while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of \texttt{nullptr} and flag-set. Other threads then would then wait on the struct changing from \texttt{nullptr} and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp} \par
|
|
|
|
The solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid pointer. We must determine a value for use in the exchange and therefore introduce a new attribute to \texttt{Cache}. The address is shared with \texttt{CacheData} upon creation, where it is then used in \texttt{CacheData::WaitOnCompletion}, exchanging the handlers with it, instead of \texttt{nullptr}. This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers by entering the false-branch of the validity check for \enquote{local\_handlers}, as shown in Figure \ref{fig:impl-cachedata-waitoncompletion}. \par
|
|
|
|
\section{Application to \glsentrylong{qdp}}
|
|
\label{sec:impl:application}
|
|
|
|
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion. This causes less delay from submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
|
|
|
|
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting and added the possibility for weak access in Section \ref{subsec:design:weakop-and-pf}. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
|
|
|
|
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user in Section \ref{subsec:design:policy-functions}. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. When these memory management functions guarantee that pages will be mapped or access latency is more important than using the cache, the user can then elect to forego enabling page fault handling on the \gls{dsa}, as enabled by Section \ref{subsec:design:weakop-and-pf}. \par
|
|
|
|
%%% Local Variables:
|
|
%%% TeX-master: "diplom"
|
|
%%% End:
|