This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

91 lines
15 KiB

  1. \chapter{Design}
  2. \label{chap:design}
  3. % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
  4. % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
  5. % sich lohnen, verschiedene Möglichkeiten durchzuspielen und dann
  6. % explizit zu begründen, warum man sich für eine bestimmte entschieden
  7. % hat. Dieses Kapitel sollte - zumindest in Stichworten - schon bei den
  8. % ersten Festlegungen eines Entwurfs skizziert werden.
  9. % Es wird sich aber in einer normal verlaufenden
  10. % Arbeit dauernd etwas daran ändern. Das Kapitel darf nicht zu
  11. % detailliert werden, sonst langweilt sich der Leser. Es ist sehr
  12. % wichtig, das richtige Abstraktionsniveau zu finden. Beim Verfassen
  13. % sollte man auf die Wiederverwendbarkeit des Textes achten.
  14. % Plant man eine Veröffentlichung aus der Arbeit zu machen, können von
  15. % diesem Kapitel Teile genommen werden. Das Kapitel wird in der Regel
  16. % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
  17. % sein, daß das Abstraktionsniveau verfehlt wurde.
  18. In this chapter, we design a class interface for a general-purpose cache. We will outline the requirements and elucidate the solutions employed to address them, culminating in the final architecture. Details pertaining to the implementation of this blueprint will be deferred to Chapter \ref{chap:implementation}, where we delve into a selection of relevant aspects. \par
  19. The target application of code contributed by this work is to accelerate \glsentrylong{qdp} by offloading copy operations to the \gls{dsa}. Prefetching is inherently related with cache functionality. Given that an application providing the latter offers a broader scope of utility beyond \gls{qdp}, we opted to implement an offloading \texttt{Cache}. \par
  20. \begin{figure}[!b]
  21. \centering
  22. \includegraphics[width=0.9\textwidth]{images/uml-cache-and-cachedata.pdf}
  23. \caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
  24. \label{fig:impl-design-interface}
  25. \end{figure}
  26. \pagebreak
  27. \section{Interface}
  28. \label{sec:design:interface}
  29. The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. Operation (3) is required when the data that is cached may also be modified, necessitating synchronization between cache and source memory. Due to various setups and use cases for the \texttt{Cache}, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush unused entries, when memory pressure during access of a new memory block demands it. \par
  30. Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. Operations (1) and (2) are provided by one single function, which we call \texttt{Cache::Access}. This function receives a data pointer and size as parameters. Its behaviour depends on the state of the cache containing an entry for the requested block, only submitting a caching operation if the pointer received is not yet cached and returning the cache entry in both cases. Operations \texttt{Cache::Access} and \texttt{Cache::Invalidate}, and additional methods, deemed useful but not covered by the behavioural requirements, of the \texttt{Cache}-class can be viewed in Figure \ref{fig:impl-design-interface}. \par
  31. Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations and allowing work to continue in parallel. To facilitate this process, the \texttt{Cache::Access} method returns an instance of the class \texttt{CacheData}. Figure \ref{fig:impl-design-interface} also documents the public interface for \texttt{CacheData} in the left block. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep. A call to \texttt{CacheData::WaitOnCompletion} is required should the user desire to use the cache, as only through it, the cache pointer is updated. \par
  32. \subsection{Policy Functions}
  33. \label{subsec:design:policy-functions}
  34. In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring functions performing dynamic memory management. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par
  35. The policy functions receive arguments deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par
  36. For memory management, two functions are required, providing allocation (\texttt{malloc}) and deallocation (\texttt{free}) capabilities to the \texttt{Cache}. Following the naming scheme, these two functions must adhere to the thread safety guarantees and behaviour set forth by the C++ standard. Most notably, \texttt{malloc} must never return the same address for subsequent or concurrent calls. \par
  37. \subsection{Cache Entry Reuse}
  38. \label{subsec:design:cache-entry-reuse}
  39. When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement entry reuse, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must then ensure that, on concurrent access to the same resource, only one thread creates \texttt{CacheData} while the others are provided with a copy. \par
  40. \subsection{Cache Entry Lifetime}
  41. \label{subsec:design:cache-entry-lifetime}
  42. Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
  43. \subsection{Weak Behaviour and Page Fault Handling}
  44. \label{subsec:design:weakop-and-pf}
  45. During our testing phase, we discovered that \gls{intel:dml} does not support interrupt-based completion signaling, as discussed in Section \ref{subsubsec:state:completion-signal}. Instead, it resorts to busy-waiting, which consumes CPU cycles, thereby impacting concurrent operations. To mitigate this issue, we extended the functionality of both \texttt{Cache} and \texttt{CacheData}, providing weak versions of \texttt{Cache::Access} and \texttt{CacheData::WaitOnCompletion}. The weak wait function only checks for operation completion once and then returns, irrespective of the completion state, thereby relaxing the guarantee that the cache location will be valid after the call. Hence, the user must verify validity after the wait if they choose to utilize this option. Similarly, weak access merely returns a pre-existing instance of \texttt{CacheData}, bypassing work submission when no entry is present for the requested memory block. These features prove beneficial in latency-sensitive scenarios where the overhead from cache operations and waiting on completion is undesirable. As the weak functions constitute optional behaviour, not integral to the \texttt{Caches'} operation, Figure \ref{fig:impl-design-interface} does not cover them. \par
  46. Additionally, while optimizing for access latency, we encountered delays caused by page fault handling on the \gls{dsa}. These not only affect the current task but also impede the progress of other tasks on the \gls{dsa}, by blocking. Consequently, the \texttt{Cache} defaults to not handle page faults. \par
  47. To configure runtime behaviour for page fault handling and access type, we introduced a flag system to both \texttt{Cache} and \texttt{CacheData}, with the latter inheriting any flags set in the former upon creation. This design allows for global settings, such as opting for weak waits or enabling page fault handling. Weak waits can also be selected in specific situations by setting the flag on the \texttt{CacheData} instance. For \texttt{Cache::Access}, the flag must be set for each function call, defaulting to strong access, as exclusively using weak access would result in no cache utilization. \par
  48. \section{Usage Restrictions}
  49. \label{sec:design:restrictions}
  50. \begin{figure}[t]
  51. \centering
  52. \includegraphics[width=0.4\textwidth]{images/overlapping-pointer-access.pdf}
  53. \caption{Visualization of a memory region and pointers \(A\) and \(B\). Due to the size of the data pointed to, the two memory regions overlap. The overlapping region is highlighted by the dashes.}
  54. \label{fig:overlapping-access}
  55. \end{figure}
  56. In the context of this work, the cache primarily manages static data, leading to two restrictions placed on the invalidation operation, allowing significant reductions in design complexity. Firstly, due to the cache's design, accessing overlapping areas will lead to undefined behaviour during the invalidation of any one of them. Only entries with equivalent source pointers will be invalidated, while other entries with differing source pointers, still covering the now-invalidated region due to their size, will remain unaffected. Consequently, the cache may or may not continue to contain invalid elements. This scenario is depicted in Figure \ref{fig:overlapping-access}, where \(A\) points to a memory region, which due to its size, overlaps with the region pointed to by \(B\). Secondly, invalidation is a manual process, necessitating the programmer to recall which data points are currently cached and to invalidate them upon modification. In this scenario, no ordering guarantees are provided, potentially resulting in threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
  57. Additionally, the cache inherits some restrictions of the \gls{dsa}. Due to the asynchronous operation and internal workings, multiple operation ordering hazards (Section \ref{subsubsec:state:ordering-guarantees}) may arise, either internally within one \gls{dsa}, or externally when coming in conflict with other \gls{dsa}s. However, these hazards do not impact the user of \texttt{Cache}, as the design implicitly prevents hazardous situations. Specifically, \texttt{CacheData::WaitOnCompletion} ensures that memory regions written to by the \gls{dsa} are only accessible after operations have completed. This guards against both intra- and inter-accelerator ordering hazards. Memory regions that a \gls{dsa} may access are internally allocated in the \texttt{Cache} and become retrievable only after operation completion is signalled, thus preventing submissions for the same destination. Additionally, multiple threads submitting tasks for the same source location does not pose a problem, as the access is read-only, and is also prevented by \texttt{Cache::Access} only performing work submission on one thread. \par
  58. Despite accounting for the hazards in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed, through a call to \texttt{CacheData::WaitOnCompletion}, will result in an undefined state for the cached region. Therefore, it is crucial to explicitly wait for caching operations to complete before modification of the source memory to avoid such scenarios. \par
  59. Conclusively, the \texttt{Cache} should not be used with overlapping buffers. Caching mutable data presents design challenges which could be solved by implementing a specific templated data type, which we will call \texttt{CacheMutWrapper}. This type internally stores its data in a struct which can then be cached and is tagged with a timestamp. \texttt{CacheMutWrapper} then ensures that all \gls{mutmet} also invalidate the cache, perform waiting on operation completion to avoid modification while the \gls{dsa} accesses the cached region, and update the timestamp. This stamp is then used by further processing steps to verify that a value presented to them by a thread was calculated using the latest version of the data contained in \texttt{CacheMutWrapper}, guarding against the lack of ordering guarantees for invalidation, as mentioned by the first paragraph of this section. \par
  60. %%% Local Variables:
  61. %%% TeX-master: "diplom"
  62. %%% End: