This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

80 lines
13 KiB

\chapter{Design}
\label{chap:design}
% Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
% eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
% sich lohnen, verschiedene Möglichkeiten durchzuspielen und dann
% explizit zu begründen, warum man sich für eine bestimmte entschieden
% hat. Dieses Kapitel sollte - zumindest in Stichworten - schon bei den
% ersten Festlegungen eines Entwurfs skizziert werden.
% Es wird sich aber in einer normal verlaufenden
% Arbeit dauernd etwas daran ändern. Das Kapitel darf nicht zu
% detailliert werden, sonst langweilt sich der Leser. Es ist sehr
% wichtig, das richtige Abstraktionsniveau zu finden. Beim Verfassen
% sollte man auf die Wiederverwendbarkeit des Textes achten.
% Plant man eine Veröffentlichung aus der Arbeit zu machen, können von
% diesem Kapitel Teile genommen werden. Das Kapitel wird in der Regel
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde.
In this chapter, we formulate a class interface for a general-purpose cache. We will outline the requirements and elucidate the solutions employed to address them, culminating in the final architecture. Details pertaining to the implementation of this blueprint will be deferred to Chapter \ref{chap:implementation}, where we delve into a selection of relevant aspects. \par
The target application of code contributed by this work is to accelerate \glsentrylong{qdp} by offloading copy operations to the \gls{dsa}. Prefetching is inherently related with cache functionality. Given that an application providing the latter offers a broader scope of utility beyond \gls{qdp}, we opted to implement an offloading \texttt{Cache}. \par
\begin{figure}[!t]
\centering
\includegraphics[width=0.9\textwidth]{images/uml-cache-and-cachedata.pdf}
\caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
\label{fig:impl-design-interface}
\end{figure}
\section{Interface}
\label{sec:design:interface}
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due to various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. \par
Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
\subsection{Policy Functions}
\label{subsec:design:policy-functions}
In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring function pointers performing dynamic memory management be passed. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par
The policy functions receive parameters deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par
For memory management, two functions are required, providing allocation (\texttt{malloc}) and deallocation (\texttt{free}) capabilities to the \texttt{Cache}. Following the naming scheme, these two functions must adhere to the thread safety guarantees and behaviour set forth by the C++ standard. Most notably, \texttt{malloc} must never return the same address for subsequent or concurrent calls. \par
\subsection{Cache Entry Reuse}
\label{subsec:design:cache-entry-reuse}
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must therefore also ensure that, on concurrent access to the same resource, only one thread creates \texttt{CacheData} while the others are provided with a copy. \par
\subsection{Cache Entry Lifetime}
\label{subsec:design:cache-entry-lifetime}
Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
\subsection{Weak Behaviour and Page Fault Handling}
\label{subsec:design:weakop-and-pf}
During our testing phase, we discovered that \gls{intel:dml} does not support interrupt-based completion signaling, as discussed in Section \ref{subsubsec:state:completion-signal}. Instead, it resorts to busy-waiting, which consumes CPU cycles and is primarily beneficial for reducing power consumption during copies \cite{intel:analysis}. To mitigate this issue, we extended the functionality of both \texttt{Cache} and \texttt{CacheData}, providing weak versions of \texttt{Cache::Access} and \texttt{CacheData::WaitOnCompletion}. The weak wait function only checks for operation completion once and then returns, relaxing the guarantee that the cache location will be valid after the call. Hence, the user must verify validity even after the wait function if they choose to utilize this option. Similarly, weak access merely returns a pre-existing instance of \texttt{CacheData}, bypassing caching. This feature proves beneficial in latency-sensitive scenarios where the overhead from cache operations and waiting for operation completion is undesirable. \par
Additionally, while optimizing for access latency, we encountered delays caused by page fault handling on the \gls{dsa}, as depicted in Figure \ref{fig:dml-memcpy}. These delays not only affect the current task but also impede the progress of other tasks on the \gls{dsa}. Consequently, the default behaviour of the cache is set to trigger an error on page faults, while still offering the option to let the \gls{dsa} handle the fault. \par
To configure runtime behaviour, we introduced a flag system to both \texttt{Cache} and \texttt{CacheData}, with the latter inheriting any flags set in the former upon creation. This design allows for global settings, such as opting for weak waits or enabling page fault handling. Weak waits can also be selected in specific situations by setting the flag on the \texttt{CacheData} instance. For \texttt{Cache::Access}, the flag must be set for each function call, defaulting to strong access, as exclusively using weak access would result in no cache utilization. \par
\section{Usage Restrictions}
\label{sec:design:restrictions}
In the context of this work, the cache primarily manages static data, leading to two restrictions placed on the invalidation operation, allowing significant reductions in design complexity. Firstly, due to the cache's design, overlapping areas in the cache will lead to undefined behaviour during the invalidation of any one of them. Only entries with equivalent source pointers will be invalidated, while other entries with differing source pointers, still covering the now-invalidated region due to their size, will remain unaffected. Consequently, the cache may or may not continue to contain invalid elements. Secondly, invalidation is a manual process, necessitating the programmer to recall which data points are currently cached and to invalidate them upon modification. In this scenario, no ordering guarantees are provided, potentially resulting in threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
Additionally, the cache inherits some restrictions due to its utilization of the \gls{dsa}. As mentioned in Section \ref{subsubsec:state:ordering-guarantees}, only write-ordering is guaranteed under specific circumstances. Although we configured the necessary parameters for this guarantee in Section \ref{sec:state:setup-and-config}, load balancing over multiple \gls{dsa}s, as described in Section \ref{sec:impl:application}, can introduce scenarios where writes to the same address may be submitted on different accelerators. As the ordering guarantee is provided on only one \gls{dsa}, undefined behaviour can occur in \enquote{multiple-writers}, in addition to the \enquote{read-after-write} scenarios. However, due to the constraints outlined in Section \ref{sec:design:interface}, the \enquote{multiple-writers} scenario is prevented, by ensuring that only one thread can perform the caching task for a given datum. Moreover, the requirement for user-provided memory management functions to be thread-safe (Section \ref{subsec:design:policy-functions}) ensures that two concurrent cache accesses will never receive the same memory region for their task. These two guarantees in conjunction secure the caches' integrity. Hence, the only relevant scenario is \enquote{read-after-write}, which is also accounted for since the cache pointer is updated by \texttt{CacheData::WaitOnCompletion} only when all operations have concluded. Situations where a caching task (read) depending on the results of another task (write) are thereby prevented. \par
Despite accounting for the complications in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed through a call to \texttt{CacheData::WaitOnCompletion} will result in an undefined state for the cached region. Therefore, it is imperative to explicitly wait for data present in the cache to avoid such scenarios. \par
%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: