@ -57,19 +57,19 @@ To be able to optimally utilize the Hardware, knowledge of its workings is requi
\label{fig:dsa-internal-block}
\end{figure}
The accelerator is directly integrated into the Processor and attaches via the I/O fabric interface over which all communication is conducted. Over this interface, it is accessible as a PCIe device. Configuration therefore is done through memory-mapped registers set in the devices \gls{bar}. Through these, the devices layout is defined and memory pages for work submission are set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node.\par
The accelerator is directly integrated into the Processor and attaches via the I/O fabric interface over which all communication is conducted. Over this interface, it is accessible as a PCIe device. Configuration therefore is done through memory-mapped registers set in the devices \gls{bar}. Through these, the devices' layout is defined and memory pages for work submission are set. In a system with multiple processing nodes, there may also be one \gls{dsa} per node.\par
To satisfy different use cases, as already mentioned, the layout of the \gls{dsa} may be software-defined. The structure is made up of three components, namely \gls{dsa:wq}s, \gls{dsa:engine}s and \gls{dsa:group}s. \gls{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. An \gls{dsa:engine} is the processing-block that connects to memory and performs the described task. Using \gls{dsa:group}s, \gls{dsa:engine}s and \gls{dsa:wq}s are tied together. This means, that tasks from one \gls{dsa:wq} may be processed from multiple \gls{dsa:engine}s and that vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter which connects the two components and acts according to the setup. \par
A \gls{dsa:wq} is accessible through so-called portals, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these portals. A descriptor is 64 Byte in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue of which there are two types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing. This results in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}. The \gls{dsa:dwq} is therefore more performant but may require access control mechanisms and may only be accessed by one process at a time. \par
A \gls{dsa:wq} is accessible through so-called portals, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these portals. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue of which there are two types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing. This results in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b}. The \gls{dsa:dwq} is therefore more performant but may require access control mechanisms and may only be accessed by one process at a time. \par
To handle the different descriptors, each \gls{dsa:engine} has two internal execution paths. One for a task and the other for a batch descriptor. Processing a task descriptor is straightforward, as all information required to complete the operation are contained within. For a batch, the \gls{dsa} first reads the batch descriptor, then fetches all task descriptors for the batch from memory and processes them. An \gls{dsa:engine} can also trigger a page fault when trying to access an unloaded page and wait on its completion, if configured to do so. Otherwise, an error will be generated in this scenario. \par
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one \gls{dsa:engine} in a \gls{dsa:group} when submitting exclusively batch or task descriptors but no mixture. Even then, only write-ordering is guaranteed, meaning that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}\cite[30]{intel:dsaspec}. A different issue arises, should an operation fail: the \gls{dsa} will continue to process the following descriptors. Care must therefore be taken with read-after-write scenarios, either by waiting for a successfull completion before submitting the dependant, inserting a drain descriptor for tasks or setting the fence flag for a batch. The latter two methods tell the processing engine that all writes must be commited and, in case of the fence in a batch, abort on previous error. \par
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one \gls{dsa:engine} in a \gls{dsa:group} when submitting exclusively batch or task descriptors but no mixture. Even then, only write-ordering is guaranteed, meaning that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}\cite[30]{intel:dsaspec}. A different issue arises, should an operation fail: the \gls{dsa} will continue to process the following descriptors. Care must therefore be taken with read-after-write scenarios, either by waiting for a successful completion before submitting the dependant, inserting a drain descriptor for tasks or setting the fence flag for a batch. The latter two methods tell the processing engine that all writes must be committed and, in case of the fence in a batch, abort on previous error. \par
An important aspect of modern computer systems is the separation of address spaces through virtual memory. The \gls{dsa} must therefore handle address translation, as a process submitting a task will not know the physical location in memory which causes the descriptor to contain virtual values. For this, the \gls{dsa:engine} communicates with the \gls{iommu} and \gls{atc} to perform this operation. For this, knowledge about the submitting processes is required, and therefore each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} or set statically after a process is attached to a \gls{dsa:dwq}. \par
The completion of a descriptor may be signaled through a completion record and interrupt, if configured so. For this, the \gls{dsa}\enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table}\cite[27]{intel:dsaspec}. \par
The completion of a descriptor may be signalled through a completion record and interrupt, if configured so. For this, the \gls{dsa}\enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table}\cite[27]{intel:dsaspec}. \par
\subsection{Software View}
@ -80,9 +80,9 @@ The completion of a descriptor may be signaled through a completion record and i
\label{fig:dsa-software-arch}
\end{figure}
Due to efforts by intel programmers, since Linux Kernel 5.10 \cite[Installation Instructinos]{intel:dmldoc}, there exists a driver for the \gls{dsa}\cite{intel:idxd-driver-repo} which has no counterpart in the Windows OS-Family \cite[Installation Instructinos]{intel:dmldoc}, meaning code developed without an alternative path will not work there. To interface with the driver and perform configuration operations, intels libaccel-conf \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[3]{intel:analysis}. \par
Due to efforts by intel programmers, since Linux Kernel 5.10 \cite[Installation Instructinos]{intel:dmldoc}, there exists a driver for the \gls{dsa}\cite{intel:idxd-driver-repo} which has no counterpart in the Windows OS-Family \cite[Installation Instructinos]{intel:dmldoc}, meaning code developed without an alternative path will not work there. To interface with the driver and perform configuration operations, Intel's libaccel-conf \cite{intel:libaccel-config-repo} user space toolset may be used which provides a command-line interface and can read configuration files to set up the device as described previously. After successful configuration, each \gls{dsa:wq} is exposed as a character device by \texttt{mmap} of the associated portal \cite[3]{intel:analysis}. \par
Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why Intels Data Mover Library \cite{intel:dmldoc} exists. With some limitations (like lacking support for \gls{dsa:dwq}s) this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware (on a \gls{dsa}) or in software (using equivalent instructions provided by the library) which makes code based upon it automatically compatible with systems that do not provide hardware or software support. \par
Given the file permissions, it would now be possible for a process to submit work to the \gls{dsa} via either \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manually configuring them. This, however, is quite cumbersome, which is why Intel's Data Mover Library \cite{intel:dmldoc} exists. With some limitations (like lacking support for \gls{dsa:dwq}s) this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \par
The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond Query Driven Prefetching, the choice was made to solve the prefetching offload by implementing an offloading \texttt{Cache}. When refering to the provided implementation, \texttt{Cache} will be used from now on. The interface with \texttt{Cache} must provide three basic functions: requesting a memory block to be cached, accessing a cached memory block and synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, requiring the entry to be updated with the source or the other way around. Due to the many possible setups and use cases, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible while being removed when system memory pressure due to restrictive memory size drives the \texttt{Cache} to flush unused entries. \par
The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond Query Driven Prefetching, the choice was made to solve the prefetching offload by implementing an offloading \texttt{Cache}. When referring to the provided implementation, \texttt{Cache} will be used from now on. The interface with \texttt{Cache} must provide three basic functions: requesting a memory block to be cached, accessing a cached memory block and synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, requiring the entry to be updated with the source or the other way around. Due to the many possible setups and use cases, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible while being removed when memory pressure due to restrictive memory size drives the \texttt{Cache} to flush unused entries. \par
\subsection{Interface}
To allow rapid integration and ease developer workload, a simple interface was chosen. As this work primarily focuses on caching static data, the choice was made only to provide cache invalidation and not synchronization. Given a memory address, \texttt{Cache::Invalidate} will remove all entries for it. The other two operations are provided in one single function, which we shall call \texttt{Cache::Access} henceforth, receiving a data pointer and size it takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. The cache placement and assignment of the task to accelerators are controlled by the user. In addition to the two basic operations outlined before, the user also is given the option to flush the cache using \texttt{Cache::Flush} of unused elements manually or to clear it completely with \texttt{Cache::Clear}. \par
As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be refered to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par
As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be referred to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par
By allowing multiple references to the same entry, memory management becomes a concern. Freeing the allocated block must only take place when all copies of a \texttt{CacheData} instance are destroyed, therefore tying cache entry lifetime to the lifetime of the longest living copy of the original instance. This makes access to the entry legal during the lifetime of any \texttt{CacheData} instance, while also guaranteeing that \texttt{Cache::Clear} will not have any unforseen side effects, as deallocation only takes place when the last consumer has \texttt{CacheData} go out of scope or manually deletes it. \par
By allowing multiple references to the same entry, memory management becomes a concern. Freeing the allocated block must only take place when all copies of a \texttt{CacheData} instance are destroyed, therefore tying cache entry lifetime to the lifetime of the longest living copy of the original instance. This makes access to the entry legal during the lifetime of any \texttt{CacheData} instance, while also guaranteeing that \texttt{Cache::Clear} will not have any unforeseen side effects, as deallocation only takes place when the last consumer has \texttt{CacheData} go out of scope or manually deletes it. \par
\subsection{Usage Restrictions}
As cache invalidation applies mainly to non-static data which this work does not focus on, two restrictions are placed on the invalidation operation. This permits drastically simpler cache design, as a fully coherent cache would require developing a thread safe coherence scheme which is outside our scope. \par
Firstly, overlapping areas in the cache will cause undefined behaviour during invalidation of any one of them. Only the entries with the equivalent source data pointer will be invalidated, while other entries with differing source pointers which, due to their size, still cover the now invalidated region, will not be invalidated and therefore the cache may and may continue to contain invalid elements at this point. \par
Firstly, overlapping areas in the cache will cause undefined behaviour during invalidation of any one of them. Only the entries with the equivalent source pointer will be invalidated, while other entries with differing source pointers which, due to their size, still cover the now invalidated region, will not be invalidated. At this point, the cache may and may continue to contain invalid elements. \par
Secondly, invalidation is to be performed manually, requiring the programmer to remember which points of data are at any given point in time cached and invalidating them upon modification. No ordering guarantees will be given for this situation, possibly leading to threads still having a pointer to now-outdated entries and continuing their progress with this. \par
@ -60,7 +60,7 @@ Due to its reliance on libnuma for memory allocation and thread pinning, \texttt
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml}\cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml}\cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}. \par
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml}\cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml}\cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}. \par