4 Commits

  1. BIN
      thesis/bachelor.pdf
  2. 3
      thesis/bachelor.tex
  3. 9
      thesis/content/40_design.tex
  4. 36
      thesis/content/50_implementation.tex

BIN
thesis/bachelor.pdf

3
thesis/bachelor.tex

@ -66,14 +66,11 @@ plainpages=false,pdfpagelabels=true]{hyperref}
\cleardoublepage
\listoffigures
\todo{ensure figure placement at top and before first reference}
\cleardoublepage
\listoftables
\cleardoublepage
\listoftodos
\todo{remove me for final version}
\cleardoublepage
\pagenumbering{arabic}

9
thesis/content/40_design.tex

@ -57,6 +57,15 @@ When multiple consumers wish to access the same memory block through the \texttt
Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
\subsection{Weak Behaviour and Page Fault Handling}
\label{subsec:design:weakop-and-pf}
During our testing phase, we discovered that \gls{intel:dml} does not support interrupt-based completion signaling, as discussed in Section \ref{subsubsec:state:completion-signal}. Instead, it resorts to busy-waiting, which consumes CPU cycles and is primarily beneficial for reducing power consumption during copies \cite{intel:analysis}. To mitigate this issue, we extended the functionality of both \texttt{Cache} and \texttt{CacheData}, providing weak versions of \texttt{Cache::Access} and \texttt{CacheData::WaitOnCompletion}. The weak wait function only checks for operation completion once and then returns, relaxing the guarantee that the cache location will be valid after the call. Hence, the user must verify validity even after the wait function if they choose to utilize this option. Similarly, weak access merely returns a pre-existing instance of \texttt{CacheData}, bypassing caching. This feature proves beneficial in latency-sensitive scenarios where the overhead from cache operations and waiting for operation completion is undesirable. \par
Additionally, while optimizing for access latency, we encountered delays caused by page fault handling on the \gls{dsa}, as depicted in Figure \ref{fig:dml-memcpy}. These delays not only affect the current task but also impede the progress of other tasks on the \gls{dsa}. Consequently, the default behaviour of the cache is set to trigger an error on page faults, while still offering the option to let the \gls{dsa} handle the fault. \par
To configure runtime behaviour, we introduced a flag system to both \texttt{Cache} and \texttt{CacheData}, with the latter inheriting any flags set in the former upon creation. This design allows for global settings, such as opting for weak waits or enabling page fault handling. Weak waits can also be selected in specific situations by setting the flag on the \texttt{CacheData} instance. For \texttt{Cache::Access}, the flag must be set for each function call, defaulting to strong access, as exclusively using weak access would result in no cache utilization. \par
\section{Usage Restrictions}
\label{sec:design:restrictions}

36
thesis/content/50_implementation.tex

@ -40,7 +40,7 @@ Even with this optimization, in scenarios where the \texttt{Cache} is frequently
\subsection{CacheData: Shared Reference}
\label{subsec:impl:cachedata-ref}
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. Using \texttt{std::shared\_ptr<T>} additionally introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations. Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to one instance of \texttt{CacheData}. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. Using \texttt{std::shared\_ptr<T>} additionally introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations. Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
\subsection{CacheData: Fair and Threadsafe WaitOnCompletion}
@ -57,9 +57,7 @@ Upon call to \texttt{Cache::Access}, coordination must again take place to only
\label{fig:impl-cachedata-threadseq-waitoncompletion}
\end{figure}
In the first implementation, a thread would check if the handlers are available and atomically wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
\todo{paragraph above complicated to read}
In the first implementation, a thread would check if the handlers are available and atomically wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see \texttt{nullptr} for the cache pointer and handlers. Therefore, \(T_2\) and \(T_3\) wait on the cache pointer becoming valid (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Then \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait on the cache pointer. Therefore, only \(T_1\) can trigger the waiting on the handlers and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\). \par
\begin{figure}[!t]
\centering
@ -68,9 +66,7 @@ In the first implementation, a thread would check if the handlers are available
\label{fig:impl-cachedata-waitoncompletion}
\end{figure}
As a solution for this, a more intricate implementation is required. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does, as nothing has to be waited for in this case. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now atomically wait on the handlers-pointer to change, instead of doing it the other way around as before. Now when \(T_1\) supplies the handlers, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer, if there are any. Through this the exclusion that was observable in the first implementation is already avoided. If nobody is waiting, then the handlers will be set to a valid pointer and a thread may pass the atomic wait instruction later on. Following this wait, the handlers-pointer is atomically exchanged with \texttt{nullptr}, invalidating it. Each thread again checks whether it has received a valid local pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer. The owning thread is tasked with actually waiting. All other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again. The solo thread may proceed to wait on the handlers and should update the cache pointer. \par
\todo{complicated formulation too, write it with more references to the pseudocode}
As a solution for this, a more intricate implementation is required, for which Figure \ref{fig:impl-cachedata-waitoncompletion} shows pseudocode. When waiting, the threads now immediately check whether the cache pointer contains a valid value and return if it does. We will use the same example as before to illustrate the second part of the waiting procedure. Both \(T_2\) and \(T_3\) arrive in this latter section as the cache was invalid at the point in time when waiting was called for. They now wait on the handlers-pointer to change. When \(T_1\) supplies the handlers after submitting work, it also uses \texttt{std::atomic<T>::notify\_one} to wake at least one thread waiting on value change of the handlers-pointer. Through this, the exclusion that was observable in the first implementation is already avoided. The handlers will be atomically set to a valid pointer and a thread may pass the wait. Following this, the handlers-pointer is atomically exchanged, invalidating it and assigning the previous valid value to a local variable. Each thread checks whether it has received a valid pointer to the handlers from the exchange. If it has then the atomic operation guarantees that is now in sole possession of the pointer and the designated master-thread. The master is tasked with waiting, while other threads will now regress and call \texttt{CacheData::WaitOnCompletion} again, leading to a wait on the master thread setting the cache to a valid value. \par
\subsection{CacheData: Edge Cases and Deadlocks}
\label{subsec:impl:cachedata-deadlocks}
@ -80,15 +76,13 @@ With the outlines of a fair implementation of \texttt{CacheData::WaitOnCompletio
\subsubsection{Initial Invalid State}
\label{subsubsec:impl:cachedata:initial-invalid-state}
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from \texttt{nullptr} to valid. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
We previously mentioned the possibly problematic situation where both the cache pointer and the handlers are not yet available for an instance in \texttt{CacheData}. This situation is avoided explicitly by the implementation due to waiting on the handlers being atomically updated from \texttt{nullptr} to valid, which can be seen in Figure \ref{fig:impl-cachedata-waitoncompletion}. When the handlers will be set in the future by the thread calling \texttt{Cache::Access} first, progress is guaranteed. \par
\subsubsection{Invalid State on Immediate Destruction}
The previous Section discussed the initial invalid state and noted that, as long as the handlers will be set in the future, progress is guaranteed. We now discuss the situation where handlers will not be set. This situation is encountered when a memory region is accessed by threads \(T_1\) and \(T_2\) concurrently. One will win the data race to add the entry to the cache state, we choose \(T_1\). \(T_2\) then must follow Section \ref{subsec:design:cache-entry-reuse} and return the entry already present in cache state. Therefore, \(T_2\) has to destroy the \texttt{CacheData} instance it created previously. \par
The destructor of \texttt{CacheData} waits on operation completion in order to ensure that no running jobs require the cache memory region, before deallocating it. This necessitates usability of \texttt{CacheData::WaitOnCompletion} for the case of immediate destruction. As the instance of \texttt{CacheData} is destroyed immediately, no tasks will be submitted to the \gls{dsa} and therefore handlers never become available, leading to deadlock on destruction. \par
To circumvent this deadlock, the initial state of \texttt{CacheData} was modified to be safe for deletion. An initialization function was added to \texttt{CacheData}, which is required to be called when the instance is to be used. \par
The destructor of \texttt{CacheData} waits on operation completion in order to ensure that no running jobs require the cache memory region, before deallocating it. This necessitates usability of \texttt{CacheData::WaitOnCompletion} for the case of immediate destruction. As the instance of \texttt{CacheData} is destroyed immediately, no tasks will be submitted to the \gls{dsa} and therefore handlers never become available, leading to deadlock on destruction, due to the wait on handlers as shown in Figure \ref{fig:impl-cachedata-waitoncompletion}. To circumvent this, the initial state of \texttt{CacheData} was modified to be safe for deletion. An initialization function was added to \texttt{CacheData}, which is required to be called when the instance is to be used. \par
\subsubsection{Invalid State on Operation Failure}
@ -96,30 +90,20 @@ To circumvent this deadlock, the initial state of \texttt{CacheData} was modifie
\subsubsection{Locally Invalid State due to Race Condition}
\todo{wording for subsection is complicated}
The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. \par
The guarantee of \texttt{std::atomic<T>::wait} to only wake up when the value has changed \cite{cppreference:atomic-wait} was found to be stronger than the promise of waking up all waiting threads with \texttt{std::atomic<T>::notify\_all} \cite{cppreference:atomic-notify-all}. As visible in Figure \ref{fig:impl-cachedata-waitoncompletion}, we wait while the handlers-pointer is \texttt{nullptr}, if the cache pointer is invalid. To exemplify we use the following scenario. Both \(T_1\) and \(T_2\) call \texttt{CacheData::WaitOnCompletion}, with \(T_1\) preceding \(T_2\). \(T_1\) exchanges the global handlers-pointer with \texttt{nullptr}, invalidating it. Before \(T_1\) can check the status of the handlers and update the cache pointer, \(T_2\) sees an invalid cache pointer and then waits for the handlers becoming available. This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par
This has again caused a similar state of invalidity as the previous two Sections handled. As the handlers will not become available again due to being cleared by \(T_1\), the second consumer, \(T_2\), will now wait indefinitely. This missed update is commonly referred to as \enquote{ABA-Problem} for which multiple solutions exist. \par
One could use double-width atomic operations which would allow resetting the pointer back to \texttt{nullptr} while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of \texttt{nullptr} and flag-set. Other threads then would then wait on the struct changing from \texttt{nullptr} and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp} \par
One could use double-width atomic operations and introduce a counter which would allow resetting the pointer back to null while setting a flag indicating the exchange took place. The handlers-pointer would then be contained in a struct with this flag, allowing exchange with a composite of \texttt{nullptr} and flag-set. Other threads then would then wait on the struct changing from \texttt{nullptr} and flag-unset, allowing them to pass if either the flag is set or the handlers have become non-null. As standard C++ does not yet support the required operations, we chose to avoid the missed update differently. \cite{dwcas-cpp} \par
The chosen solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid value. We must determine a secondary invalid pointer for use in the exchange. Therefore, we introduce a new attribute, of the same type as the one pointed to by the handlers-pointer, to \texttt{Cache}. The \texttt{Cache} then shares it with each instance of \texttt{CacheData}, where it is then used in \texttt{CacheData::WaitOnCompletion}. \par
This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers. \par
The solution for this is to not exchange the handlers-pointer with \texttt{nullptr} but with a second invalid pointer. We must determine a value for use in the exchange and therefore introduce a new attribute to \texttt{Cache}. The address is shared with \texttt{CacheData} upon creation, where it is then used in \texttt{CacheData::WaitOnCompletion}, exchanging the handlers with it, instead of \texttt{nullptr}. This secondary value allows \(T_2\) to pass the wait, then perform the exchange of handlers itself. \(T_2\) then checks the local copy of the handlers-pointer for validity. The invalid state now includes both \texttt{nullptr} and the secondary invalid pointer chosen. With this, the deadlock is avoided and \(T_2\) will wait for \(T_1\) completing the processing of the handlers by entering the false-branch of the validity check for \enquote{local\_handlers}, as shown in Figure \ref{fig:impl-cachedata-waitoncompletion}. \par
\section{Application to \glsentrylong{qdp}}
\label{sec:impl:application}
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion. This causes less delay from submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
\todo{write our observations and then link to the to-be-added section describing these updates}
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. With this we also remove the dependency on libnuma, mentioned as a restriction in Section \ref{subsec:design:restrictions}. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting and added the possibility for weak access in Section \ref{subsec:design:weakop-and-pf}. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
\todo{dont write about modifying the design but adapt the design and add forward references there to here to explain necessity}
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user in Section \ref{subsec:design:policy-functions}. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. When these memory management functions guarantee that pages will be mapped or access latency is more important than using the cache, the user can then elect to forego enabling page fault handling on the \gls{dsa}, as enabled by Section \ref{subsec:design:weakop-and-pf}. \par
%%% Local Variables:
%%% TeX-master: "diplom"

Loading…
Cancel
Save