6 Commits

  1. BIN
      benchmarks/benchmark-plots/plot-dsa-throughput-scaling.pdf
  2. BIN
      benchmarks/benchmark-plotters/__pycache__/common.cpython-39.pyc
  3. 5
      benchmarks/benchmark-plotters/plot-perf-peakthroughput-bar.py
  4. BIN
      thesis/bachelor.pdf
  5. 3
      thesis/bachelor.tex
  6. 6
      thesis/content/20_state.tex
  7. 35
      thesis/content/30_performance.tex
  8. 53
      thesis/content/40_design.tex
  9. 10
      thesis/content/50_implementation.tex
  10. 2
      thesis/content/70_conclusion.tex
  11. 100
      thesis/images/image-source/design-classdiagram.xml
  12. BIN
      thesis/images/plot-dsa-throughput-scaling.pdf
  13. BIN
      thesis/images/uml-cache-and-cachedata.pdf
  14. 44
      thesis/own.gls

BIN
benchmarks/benchmark-plots/plot-dsa-throughput-scaling.pdf

BIN
benchmarks/benchmark-plotters/__pycache__/common.cpython-39.pyc

5
benchmarks/benchmark-plotters/plot-perf-peakthroughput-bar.py

@ -113,7 +113,7 @@ def main(node_config):
def get_scaling_factor(baseline,topline,utilfactor):
return (topline / baseline) * (1 / utilfactor)
return (topline / baseline)
if __name__ == "__main__":
@ -151,8 +151,7 @@ if __name__ == "__main__":
plt.figure(figsize=(2, 3))
fig = sns.lineplot(x=x_dsacount, y=y_scaling, data=scaling_df, marker='o', linestyle='-', color='b', markersize=8)
plt.xticks([1,2,4,8])
plt.yticks([0.25,0.5,0.75,1.0])
plt.xlim(0,9)
plt.ylim(0.2,1.05)
plt.ylim(0.9,2.1)
plt.savefig(os.path.join(output_path, f"plot-dsa-throughput-scaling.pdf"), bbox_inches='tight')

BIN
thesis/bachelor.pdf

3
thesis/bachelor.tex

@ -97,9 +97,8 @@ plainpages=false,pdfpagelabels=true]{hyperref}
\appendix
\setglossarystyle{altlistgroup}
\setglossarystyle{altlist}
\printglossaries
\todo{write glossary entries}
\cleardoublepage
\printbibliography

6
thesis/content/20_state.tex

@ -62,7 +62,8 @@ Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks. Theref
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication} \cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
\subsection{Hardware Architecture} \label{subsection:dsa-hwarch}
\subsection{Hardware Architecture}
\label{subsection:dsa-hwarch}
\begin{figure}[!t]
\centering
@ -74,6 +75,7 @@ Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the
The \gls{dsa} chip is directly integrated into the processor and attaches via the I/O fabric interface, serving as the conduit for all communication. Through this interface, the \gls{dsa} is accessible as a PCIe device. Consequently, configuration utilizes memory-mapped registers set in the devices \gls{bar}. In a system with multiple processing nodes, there may also be one \gls{dsa} per node, resulting in up to four DSA devices per socket in \(4^{th}\) generation Intel Xeon Processors \cite[Sec. 3.1.1]{intel:dsaguide}. To accommodate various use cases, the layout of the \gls{dsa} is software-defined. The structure comprises three components, which we will describe in detail. We also briefly explain how the \gls{dsa} resolves virtual addresses and signals operation completion. At last, we will detail operation execution ordering. \par
\subsubsection{Architectural Components}
\label{subsec:state:dsa-arch-comp}
\textsc{Component \rom{1}, \glsentrylong{dsa:wq}:} \glsentryshort{dsa:wq}s provide the means to submit tasks to the device and will be described in more detail shortly. They are marked yellow in Figure \ref{fig:dsa-internal-block}. A \gls{dsa:wq} is accessible through so-called portals, light blue in Figure \ref{fig:dsa-internal-block}, which are mapped memory regions. Submission of work is done by writing a descriptor to one of these. A descriptor is 64 bytes in size and may contain one specific task (task descriptor) or the location of a task array in memory (batch descriptor). Through these portals, the submitted descriptor reaches a queue. There are two possible queue types with different submission methods and use cases. The \gls{dsa:swq} is intended to provide synchronized access to multiple processes and each group may only have one attached. A \gls{pcie-dmr}, which guarantees implicit synchronization, is generated via \gls{x86:enqcmd} and communicates with the device before writing \cite[Sec. 3.3.1]{intel:dsaspec}. This may result in higher submission cost, compared to the \gls{dsa:dwq} to which a descriptor is submitted via \gls{x86:movdir64b} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
@ -84,6 +86,7 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
\textsc{Component \rom{3}, Groups:} Groups tie Engines and \glsentrylong{dsa:wq}s together, indicated by the dotted blue line around the components of Group 0 in Figure \ref{fig:dsa-internal-block}. This means, that tasks from one \gls{dsa:wq} may be processed from multiple Engines and vice-versa, depending on the configuration. This flexibility is achieved through the Group Arbiter, represented by the orange block in Figure \ref{fig:dsa-internal-block}, which connects the two components according to the user-defined configuration. \par
\subsubsection{Virtual Address Resolution}
\label{subsubsec:state:dsa-vaddr}
An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
@ -93,6 +96,7 @@ An important aspect of computer systems is the abstraction of physical memory ad
The status of an operation on the \gls{dsa} is available in the form of a record, which is written to a memory location specified in the task descriptor. Applications can check for a change in value in this record to determine completion. Additionally, completion may be signalled by an interrupt. To facilitate this, the \gls{dsa} \enquote{provides two types of interrupt message storage: (1) an MSI-X table, enumerated through the MSI-X capability; and (2) a device-specific Interrupt Message Storage (IMS) table} \cite[Sec. 3.7]{intel:dsaspec}. \par
\subsubsection{Ordering Guarantees}
\label{subsubsec:state:ordering-guarantees}
Ordering of operations is only guaranteed for a configuration with one \gls{dsa:wq} and one Engine in a Group when exclusively submitting batch or task descriptors but no mixture. Even in such cases, only write-ordering is guaranteed, implying that \enquote{reads by a subsequent descriptor can pass writes from a previous descriptor}. Challenges arise, when an operation fails, as the \gls{dsa} will continue to process the following descriptors from the queue. Consequently, caution is necessary in read-after-write scenarios. This can be addressed by either waiting for successful completion before submitting the dependent descriptor, inserting a drain descriptor for tasks, or setting the fence flag for a batch. The latter two methods inform the processing engine that all writes must be committed, and in case of the fence in a batch, to abort on previous error. \cite[Sec. 3.9]{intel:dsaspec} \par

35
thesis/content/30_performance.tex

@ -42,9 +42,7 @@ For all \gls{dsa}s used in the benchmark, a submission thread executing the inne
\section{Benchmarks}
\label{sec:perf:bench}
In this Section, we will present three benchmarks, each accompanied by setup information and a preview. We will then provide plots displaying the results, followed by a detailed analysis. We will formulate expectations and compare them with the observations from our measurements. \par
\todo{reformulate}
In this section, we will introduce three benchmarks, providing setup information and a brief overview of each. Subsequently, we will present plots illustrating the results, followed by a comprehensive analysis. Our approach involves formulating expectations and juxtaposing them with the observations derived from our measurements. \par
\subsection{Submission Method}
\label{subsec:perf:submitmethod}
@ -87,7 +85,7 @@ Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest
\[8800\ MT/s \times 8B/T = 70400 \times 10^6 B/s = 65.56\ GiB/s\]
From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56 GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56/ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
@ -144,22 +142,18 @@ For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-
\begin{subfigure}[t]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-dsa-throughput-scaling.pdf}
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \(Throughput\ /\ Baseline \times 1\ /\ Utilization\) with the Baseline being Throughput for one \gls{dsa} and the Utilization representing the factor of the amount of \gls{dsa}s being used over the baseline.}
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Calculated by dividing the throughput for a configuration by the throughput achieved with one \gls{dsa}. Linear scaling is desirable to not waste resources, therefore only the configurations with one and two \gls{dsa}s are determined to be effective.}
\label{fig:perf-dsa-analysis:scaling}
\end{subfigure}
\caption{Scalability Analysis for different amounts of participating \gls{dsa}s. Displays the average throughput and the derived scaling factor. Shows that, although the throughput does increase with adding more accelerators, beyond two, the gained speed drops significantly. Calculated over the results from Figure \ref{fig:perf-dsa} and therefore applies to copies from \glsentryshort{dram} to \glsentryshort{hbm}.}
\label{fig:perf-dsa-analysis}
\end{figure}
\todo{tp scaling shows assumption: limited benefit of adding more than 2 dsa}
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa:2}, performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa-analysis:scaling}, using Brute-Force still leads to a slight increase in overall throughput, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \todo{we state that it decreases but then that it increases} \par
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par
From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par
The choice of a load balancing method is not trivial. If peak throughput of one task is of relevance, Brute-Force for inter-socket and four \gls{dsa}s for intra-socket operation result in the fastest transfers. At the same time, these cause high system utilization, making them unsuitable for situations where multiple tasks may be submitted. For these cases, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling — see Figure \ref{fig:perf-dsa-analysis:scaling}. \par
\todo{dont choose brute here, 4dsa is best}
The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par
\subsection{Data Movement using CPU}
\label{subsec:perf:cpu-datacopy}
@ -185,30 +179,23 @@ For evaluating CPU copy performance we use the benchmark code from the previous
\label{fig:perf-cpu}
\end{figure}
As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. As the sole benchmark, the software path however performs as expected for transfers to \gls{numa:node}s 12 and 15, with the latter performing worse \todo{false statement}. Taking the layout from Figure \ref{fig:perf-xeonmaxnuma}, back in the previous Section, we assumed that \gls{numa:node} 12 would outperform \gls{numa:node} 15 due to lower physical distance. This assumption was invalidated, making the result for CPU in this case unexpected. \par
In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios \todo{false they exhibit it, but not as pronounced}. \par
\todo{ensure correctness}
As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation, validating the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as observed in Section \ref{subsec:perf:datacopy}, can be seen in Figure \ref{fig:perf-cpu}. This points to an architectural anomaly which we could not explain with our knowledge or benchmarks. Further benchmarks were conducted for this, not yielding conclusive results, as the source and copy-thread location did not seem to affect the observed speed delta. \par
\section{Analysis}
\label{sec:perf:analysis}
In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline \todo{we dont do this, either write it or leave this out}. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm} \todo{weird wording}. \par
In this section we summarize the conclusions from the performed benchmarks, outlining a utilization guideline. \par
\begin{itemize}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\) and otherwise use the CPU \todo{why otherwise cpu, no direct link given}.
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\). Otherwise, offloading might prove more expensive than performing the copy on CPU.
\item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
\item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
\item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To still utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding lead us to implement round-robin balancing in Section \ref{sec:impl:application}.
\end{itemize}
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs similar to the CPU, demonstrating potential for faster data movement while simultaneously freeing up cycles for other tasks. \todo{ensure correctness and reformulate} \par
We discovered an anomaly for \gls{numa:node} 12 for which we did not find an explanation. As the behaviour is also exhibited by the CPU, discovering the root issue falls outside the scope of this work. \par
\todo{mention measures undertaken to find an explanation here}
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computation. \par
Even though we could not find an explanation for all measurements, this chapter still gives insight into the performance of the \gls{dsa}, its strengths and its weaknesses. It provides data-driven guidance on a complex architecture, helping to find the optimum for applying the \gls{dsa} to our expected and possibly different workloads. \par
We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in determining the optimal approach for optimal utilization of the \gls{dsa}. \par
%%% Local Variables:
%%% TeX-master: "diplom"

53
thesis/content/40_design.tex

@ -18,13 +18,9 @@
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde.
In this chapter we design a class interface for use as a general purpose cache. We will present challenges and then detail the solutions employed to face them, finalizing the architecture with each step. Details on the implementation of this blueprint will be omitted, as we discuss a selection of relevant aspects in Chapter \ref{chap:implementation}. \par
In this chapter, we formulate a class interface for a general-purpose cache. We will outline the requirements and elucidate the solutions employed to address them, culminating in the final architecture. Details pertaining to the implementation of this blueprint will be deferred to Chapter \ref{chap:implementation}, where we delve into a selection of relevant aspects. \par
\section{Cache Design} \label{sec:design:cache}
The task of prefetching is somewhat aligned with that of a cache. As a cache is more generic and allows use beyond \gls{qdp}, the decision was made to address the prefetching in \gls{qdp} by implementing an offloading \texttt{Cache}. Henceforth, when referring to the provided implementation, we will use \texttt{Cache}. \todo{reformulate} \par
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
The target application of code contributed by this work is to accelerate \glsentrylong{qdp} by offloading copy operations to the \gls{dsa}. Prefetching is inherently related with cache functionality. Given that an application providing the latter offers a broader scope of utility beyond \gls{qdp}, we opted to implement an offloading \texttt{Cache}. \par
\begin{figure}[!t]
\centering
@ -33,45 +29,42 @@ The interface of \texttt{Cache} must provide three basic functions: (1) requesti
\label{fig:impl-design-interface}
\end{figure}
\subsection{Interface}
\todo{mention custom malloc for outsourcing allocation to user}
\todo{give better intro for this subsection}
\section{Interface}
\label{sec:design:interface}
Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. The user retains control over cache placement and the assignment of tasks to accelerators through mechanisms outlined in \ref{sec:design:accel-usage}. This interface is represented on the right block of Figure \ref{fig:impl-design-interface} labelled \enquote{Cache} and includes some additional operations beyond the basic requirements. \par
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating synchronization. Due to various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. \par
Given that this work primarily focuses on caching static data, we only provide cache invalidation and not synchronization. The \texttt{Cache::Invalidate} function, given a memory address, will remove all entries for it from the cache. The other two operations, caching and access, are provided in one single function, which we shall henceforth call \texttt{Cache::Access}. This function receives a data pointer and size as parameters and takes care of either submitting a caching operation if the pointer received is not yet cached or returning the cache entry if it is. \par
To facilitate this process \todo{not only for this, cachedata is also needed to allow async operation}, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
Given the asynchronous nature of caching operations, users may opt to await their completion. This proves particularly beneficial when parallel threads are actively processing, and the current thread strategically pauses until its data becomes available in faster memory, thereby optimizing access speeds for local computations. To facilitate this process, the \texttt{Cache::Access} method returns an instance of an object referred to as \texttt{CacheData}. Figure \ref{fig:impl-design-interface} documents the public interface for \texttt{CacheData} on the left block labelled as such. Invoking \texttt{CacheData::GetDataLocation} provides access to a pointer to the location of the cached data. Additionally, the \texttt{CacheData::WaitOnCompletion} method is available, designed to return only upon the completion of the caching operation. During this period, the current thread will sleep, allowing unimpeded progress for other threads. To ensure that only pointers to valid memory regions are returned, this function must be called in order to update the cache pointer which otherwise has an undefined value. It queries the completion state of the operation, and, on success, updates the cache pointer to the then available memory region. \par
\subsection{Cache Entry Reuse} \label{subsec:design:cache-entry-reuse}
\subsection{Policy Functions}
\label{subsec:design:policy-functions}
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. \par
In the introduction of this chapter, we mentioned placing cache placement and selecting copy-participating \gls{dsa}s in the responsibility of the user. As we will find out in Section \ref{sec:impl:application}, allocating memory inside the cache is not feasible due to possible delays encountered. Therefore, the user is also required to provide functionality for dynamic memory management to the \texttt{Cache}. The former is realized by what we will call \enquote{Policy Functions}, which are function pointers passed on initialization, as visible in \texttt{Cache::Init} in Figure \ref{fig:impl-design-interface}. We use the same methodology for the latter, requiring function pointers performing dynamic memory management be passed. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed for the implementation in Chapter \ref{chap:implementation}, while we detail their required behaviour here. \par
\subsection{Cache Entry Lifetime} \label{subsec:design:cache-entry-lifetime}
The policy functions receive parameters deemed sensible for determining placement and participation selection. Both are informed of the source \glsentryshort{node}, the \glsentryshort{node} requesting caching, and the data size. The cache placement policy then returns a \glsentryshort{node}-ID on which the data is to be cached, while the copy policy will provide the cache with a list of \glsentryshort{node}-IDs, detailing which \gls{dsa}s should participate in the operation. \par
Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
\subsection{Usage Restrictions} \label{subsec:design:restrictions}
For memory management, two functions are required, providing allocation (\texttt{malloc}) and deallocation (\texttt{free}) capabilities to the \texttt{Cache}. Following the naming scheme, these two functions must adhere to the thread safety guarantees and behaviour set forth by the C++ standard. Most notably, \texttt{malloc} must never return the same address for subsequent or concurrent calls. \par
The cache, in the context of this work, primarily handles static data. Therefore, two restrictions are placed on the invalidation operation. This decision results in a drastically simpler cache design, as implementing a fully coherent cache would require developing a thread-safe coherence scheme, which is beyond the scope of our work. \par
\subsection{Cache Entry Reuse}
\label{subsec:design:cache-entry-reuse}
Firstly, overlapping areas in the cache will result in undefined behaviour during the invalidation of any one of them. Only the entries with the equivalent source pointer will be invalidated, while other entries with differing source pointers, which due to their size, still cover the now invalidated region, will remain unaffected. At this point, the cache may or may not continue to contain invalid elements. \par
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we face a choice between providing each with their own entry or sharing one for all consumers. The first option may lead to high load on the accelerator due to multiple copy operations being submitted and also increases the memory footprint of the cache. The latter option, although more complex, was chosen to address these concerns. To implement this, the existing \texttt{CacheData} will be extended in scope to handle multiple consumers. Copies of it can be created, and they must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. This is illustrated by the green markings, indicating thread safety guarantees for access, in Figure \ref{fig:impl-design-interface}. The \texttt{Cache} must therefore also ensure that, on concurrent access to the same resource, only one Thread creates \texttt{CacheData} while the others are provided with a copy. \par
Secondly, invalidation is a manual process, requiring the programmer to remember which points of data are currently cached and to invalidate them upon modification. No ordering guarantees are provided in this situation, potentially leading to threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
\subsection{Cache Entry Lifetime}
\label{subsec:design:cache-entry-lifetime}
Due to its reliance on libnuma for memory allocation, \texttt{Cache} is exclusively compatible with systems where this library is available. It is important to note that Windows platforms use their own API for this purpose, which is incompatible with libnuma, rendering the code non-executable on such systems \cite{microsoft:numa-malloc}. \par
Allowing multiple references to the same entry introduces concerns regarding memory management. The allocated block should only be freed when all copies of a \texttt{CacheData} instance are destroyed, thereby tying the cache entry's lifetime to the longest living copy of the original instance. This ensures that access to the entry is legal during the lifetime of any \texttt{CacheData} instance. Therefore, deallocation only occurs when the last copy of a \texttt{CacheData} instance is destroyed. \par
\todo{remove the libnuma restriction}
\todo{mention undefined behaviour when writing to cache source before having waited for caching operation completion}
\todo{mention undefined behaviour through read-after-write pass according to ordering guarantees by dsa}
\section{Usage Restrictions}
\label{sec:design:restrictions}
\section{Accelerator Usage}
\label{sec:design:accel-usage}
In the context of this work, the cache primarily manages static data, leading to two restrictions placed on the invalidation operation, allowing significant reductions in design complexity. Firstly, due to the cache's design, overlapping areas in the cache will lead to undefined behaviour during the invalidation of any one of them. Only entries with equivalent source pointers will be invalidated, while other entries with differing source pointers, still covering the now-invalidated region due to their size, will remain unaffected. Consequently, the cache may or may not continue to contain invalid elements. Secondly, invalidation is a manual process, necessitating the programmer to recall which data points are currently cached and to invalidate them upon modification. In this scenario, no ordering guarantees are provided, potentially resulting in threads still holding pointers to now-outdated entries and continuing their progress with this data. \par
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is relatively straightforward, thanks in part to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in the cache, work is submitted to the accelerator. However, before proceeding, the desired location for the cache entry must be determined which the user-defined cache placement policy function handles. Once the desired placement is obtained, the copy policy then determines, which nodes should participate in the copy operation. Following Section \ref{subsection:dsa-hwarch}, this is equivalent to selecting the accelerators. The copy tasks are distributed across the participating nodes. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in Chapter \ref{chap:implementation}. \par
Additionally, the cache inherits some restrictions due to its utilization of the \gls{dsa}. As mentioned in Section \ref{subsubsec:state:ordering-guarantees}, only write-ordering is guaranteed under specific circumstances. Although we configured the necessary parameters for this guarantee in Section \ref{sec:state:setup-and-config}, load balancing over multiple \gls{dsa}s, as described in Section \ref{sec:impl:application}, can introduce scenarios where writes to the same address may be submitted on different accelerators. As the ordering guarantee is provided on only one \gls{dsa}, undefined behaviour can occur in \enquote{multiple-writers}, in addition to the \enquote{read-after-write} scenarios. However, due to the constraints outlined in Section \ref{sec:design:interface}, the \enquote{multiple-writers} scenario is prevented, by ensuring that only one thread can perform the caching task for a given datum. Moreover, the requirement for user-provided memory management functions to be thread-safe (Section \ref{subsec:design:policy-functions}) ensures that two concurrent cache accesses will never receive the same memory region for their task. These two guarantees in conjunction secure the caches' integrity. Hence, the only relevant scenario is \enquote{read-after-write}, which is also accounted for since the cache pointer is updated by \texttt{CacheData::WaitOnCompletion} only when all operations have concluded. Situations where a caching task (read) depending on the results of another task (write) are thereby prevented. \par
\todo{modify naming, possible outsource policy functions, mention memory policy}
Despite accounting for the complications in operation ordering, one potential situation may still lead to undefined behaviour. Since the \gls{dsa} operates asynchronously, modifying data for a region present in the cache before ensuring that all caching operations have completed through a call to \texttt{CacheData::WaitOnCompletion} will result in an undefined state for the cached region. Therefore, it is imperative to explicitly wait for data present in the cache to avoid such scenarios. \par
%%% Local Variables:
%%% TeX-master: "diplom"

10
thesis/content/50_implementation.tex

@ -20,19 +20,17 @@
% nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
% auf denen man sie sich ansehen kann.
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Finally, we apply the cache to \glsentrylong{qdp}, detailing the policies mentioned in Section \ref{sec:design:accel-usage} and presenting solutions for the challenges encountered. \par
\todo{potentially mention that most of what this chapter discusses only is internal and therefore might affect but not necessitate action by the user}
In this chapter, we concentrate on specific implementation details, offering an in-depth view of how the design promises outlined in Chapter \ref{chap:design} are realized. Firstly, we delve into the usage of locking and atomics to achieve thread safety. Finally, we apply the cache to \glsentrylong{qdp}, detailing the policies mentioned in Section \ref{subsec:design:policy-functions} and presenting solutions for the challenges encountered.\par
\section{Synchronization for Cache and CacheData}
The usage of locking and atomics to achieve safe concurrent access has proven to be challenging. Their use is performance-critical, and mistakes may lead to deadlock. Consequently, these aspects constitute the most interesting part of the implementation, which is why this chapter will extensively focus on the details of their implementation. \par
Throughout the following sections we will use the term \enquote{handler} \todo{add backref to state 2.4 where handler is also mentioned}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par
Throughout the following sections we will use the term \enquote{handler}, which was coined by \gls{intel:dml}, referring to an object associated with an operation on the accelerator. Through it, the state of a task may be queried, making the handler our connection to the asynchronously executed task. Use of a handler is also displayed in the \texttt{memcpy}-function for the \gls{dsa} as shown in Figure \ref{fig:dml-memcpy}. As we may split up one single copy into multiple distinct tasks for submission to multiple \gls{dsa}s, \texttt{CacheData} internally contains a vector of multiple of these handlers. \par
\subsection{Cache: Locking for Access to State} \label{subsec:implementation:cache-state-lock}
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:cache} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
To keep track of the current cache state the \texttt{Cache} will hold a reference to each currently existing \texttt{CacheData} instance. The reason for this is twofold: In Section \ref{sec:design:interface} we decided to keep elements in the cache until forced by \gls{mempress} to remove them. Secondly in Section \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers. The second part requires access to the structure holding this reference to be thread safe when accessing and modifying the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require unique locking, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first, only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache. \par
A map-datastructure was chosen to represent the current cache state with the key being the memory address of the entry and as value the \texttt{CacheData} instance. As the caching policy is controlled by the user, one datum may be requested for caching in multiple locations. To accommodate this, one map is allocated for each available \glsentrylong{numa:node} of the system. This can be exploited to reduce lock contention by separately locking each \gls{numa:node}'s state instead of utilizing a global lock. This ensures that \texttt{Cache::Access} and the implicit \texttt{Cache::Flush} it may cause can not hinder progress of caching operations on other \gls{numa:node}s. Both \texttt{Cache::Clear} and a complete \texttt{Cache::Flush} as callable by the user will now iteratively perform their respective task per \gls{numa:node} state, also allowing other \gls{numa:node} to progress.\par
@ -130,6 +128,8 @@ This secondary value allows \(T_2\) to pass the wait, then perform the exchange
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
\todo{not instead here but due to analysis}
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
\todo{write our observations and then link to the to-be-added section describing these updates}

2
thesis/content/70_conclusion.tex

@ -18,7 +18,7 @@
In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par
Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{subsec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that work was completed with the most current version. \par
Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that work was completed with the most current version. \par
In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par

100
thesis/images/image-source/design-classdiagram.xml

@ -1,80 +1,80 @@
<mxfile host="app.diagrams.net" modified="2024-01-22T13:20:17.652Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" etag="97LZP_lnqzgC0hqZ7VOA" version="22.1.21" type="device">
<mxfile host="app.diagrams.net" modified="2024-02-15T12:40:37.148Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" etag="bDsGiENRX6gaS5lY3tf3" version="22.1.21" type="device">
<diagram name="Page-1" id="xBjmK5o3fU9FCVY3KYoo">
<mxGraphModel dx="1434" dy="803" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="827" pageHeight="369" math="0" shadow="0">
<mxGraphModel dx="276" dy="154" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="827" pageHeight="369" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-23" value="" style="group;fillColor=#d5e8d4;strokeColor=#82b366;" parent="1" vertex="1" connectable="0">
<mxGeometry x="170" y="40" width="230" height="190" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-23" value="" style="group;fillColor=#d5e8d4;strokeColor=none;strokeWidth=9;container=0;" parent="1" vertex="1" connectable="0">
<mxGeometry x="130" y="40" width="230" height="190" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry width="230" height="190" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-26" value="" style="group;fillColor=#fff2cc;strokeColor=#d6b656;container=0;" parent="1" vertex="1" connectable="0">
<mxGeometry x="440" y="40" width="230" height="270" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-14" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="440" y="40" width="230" height="190" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-2" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=4;fillColor=none;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry width="230" height="190" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-15" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=3;fillColor=none;" parent="1" vertex="1">
<mxGeometry x="440" y="40" width="230" height="290" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-3" value="&lt;b&gt;CacheData&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=2;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry width="230" height="40" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-16" value="&lt;b&gt;Cache&lt;br&gt;&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=1;" parent="1" vertex="1">
<mxGeometry x="440" y="40" width="230" height="40" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-5" value="CacheData(uint8_t* data, size_t size)" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry y="40" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-17" value="Cache() = default" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="1" vertex="1">
<mxGeometry x="440" y="80" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-6" value="CacheData(const CacheData&amp;amp; other)" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry y="70" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-18" value="void Flush(int node = -1)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="440" y="245" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-7" value="~CacheData()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry y="100" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-19" value="~Cache()" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="440" y="110" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-8" value="void WaitOnCompletion()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry y="130" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-20" value="void Init(CachePolicy*, CopyPolicy*, MemFree*, MemAlloc*)" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="440" y="140" width="230" height="60" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-9" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="tB5LUjmhD6zCi5oJ_og_-23" source="tB5LUjmhD6zCi5oJ_og_-8" target="tB5LUjmhD6zCi5oJ_og_-8" edge="1">
<mxCell id="tB5LUjmhD6zCi5oJ_og_-21" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="1" source="tB5LUjmhD6zCi5oJ_og_-20" target="tB5LUjmhD6zCi5oJ_og_-20" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="530" y="370" as="sourcePoint" />
<mxPoint x="580" y="320" as="targetPoint" />
<mxPoint x="970" y="410" as="sourcePoint" />
<mxPoint x="1020" y="360" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-11" value="uint8_t* GetDataLocation() const" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-23" vertex="1">
<mxGeometry y="160" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-22" value="std::unique_ptr&amp;lt;CacheData&amp;gt; Access(uint8_t* data, size_t size)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="440" y="195" width="230" height="50" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-26" value="" style="group;fillColor=#fff2cc;strokeColor=#d6b656;" parent="1" vertex="1" connectable="0">
<mxGeometry x="440" y="40" width="230" height="270" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-24" value="void Clear()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="440" y="275" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-14" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry width="230" height="190" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-25" value="void Invalidate()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="440" y="305" width="230" height="25" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-15" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=4;fillColor=none;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry width="230" height="270" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" parent="1" vertex="1">
<mxGeometry x="130" y="40" width="230" height="190" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-16" value="&lt;b&gt;Cache&lt;br&gt;&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=2;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry width="230" height="40" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-2" value="" style="rounded=0;whiteSpace=wrap;html=1;strokeWidth=3;fillColor=none;" parent="1" vertex="1">
<mxGeometry x="130" y="40" width="230" height="190" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-17" value="Cache() = default" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="40" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-3" value="&lt;b&gt;CacheData&lt;/b&gt;" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=default;strokeColor=default;strokeWidth=1;" parent="1" vertex="1">
<mxGeometry x="130" y="40" width="230" height="40" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-18" value="void Flush(int node = -1)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="180" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-5" value="CacheData(uint8_t* data, size_t size)" style="text;html=1;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;" parent="1" vertex="1">
<mxGeometry x="130" y="80" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-19" value="~Cache()" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="70" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-6" value="CacheData(const CacheData&amp;amp; other)" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="130" y="110" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-20" value="void Init(CachePolicy*, CopyPolicy*)" style="text;html=1;strokeColor=#b85450;fillColor=#f8cecc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="100" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-7" value="~CacheData()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="130" y="140" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-21" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="tB5LUjmhD6zCi5oJ_og_-26" source="tB5LUjmhD6zCi5oJ_og_-20" target="tB5LUjmhD6zCi5oJ_og_-20" edge="1">
<mxCell id="tB5LUjmhD6zCi5oJ_og_-8" value="void WaitOnCompletion()" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="130" y="170" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-9" value="" style="endArrow=none;html=1;rounded=0;exitX=1;exitY=0;exitDx=0;exitDy=0;entryX=0;entryY=0;entryDx=0;entryDy=0;strokeWidth=2;strokeColor=none;" parent="1" source="tB5LUjmhD6zCi5oJ_og_-8" target="tB5LUjmhD6zCi5oJ_og_-8" edge="1">
<mxGeometry width="50" height="50" relative="1" as="geometry">
<mxPoint x="530" y="370" as="sourcePoint" />
<mxPoint x="580" y="320" as="targetPoint" />
<mxPoint x="660" y="410" as="sourcePoint" />
<mxPoint x="710" y="360" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-22" value="std::unique_ptr&amp;lt;CacheData&amp;gt; Access(uint8_t* data, size_t size)" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="130" width="230" height="50" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-24" value="void Clear()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="210" width="230" height="30" as="geometry" />
</mxCell>
<mxCell id="tB5LUjmhD6zCi5oJ_og_-25" value="void Invalidate()" style="text;html=1;strokeColor=#d6b656;fillColor=#fff2cc;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="tB5LUjmhD6zCi5oJ_og_-26" vertex="1">
<mxGeometry y="240" width="230" height="30" as="geometry" />
<mxCell id="tB5LUjmhD6zCi5oJ_og_-11" value="uint8_t* GetDataLocation() const" style="text;html=1;strokeColor=#82b366;fillColor=#d5e8d4;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" parent="1" vertex="1">
<mxGeometry x="130" y="200" width="230" height="30" as="geometry" />
</mxCell>
</root>
</mxGraphModel>

BIN
thesis/images/plot-dsa-throughput-scaling.pdf

BIN
thesis/images/uml-cache-and-cachedata.pdf

44
thesis/own.gls

@ -3,7 +3,7 @@
name={IOMMU},
long={Input/Output Memory Management Unit},
first={Input/Output Memory Management Unit (IOMMU)},
description={... desc ...}
description={\textsc{\glsentrylong{iommu}:} \todo{write iomu description}}
}
\newglossaryentry{atc}{
@ -11,7 +11,7 @@
name={ATC},
long={Address Translation Cache},
first={Address Translation Cache (ATC)},
description={... desc ...}
description={\textsc{\glsentrylong{atc}:} \todo{write arc description}}
}
\newglossaryentry{bar}{
@ -19,7 +19,7 @@
name={BAR},
long={Base Address Register},
first={Base Address Register (BAR)},
description={... desc ...}
description={\textsc{\glsentrylong{bar}:} \todo{write bar description}}
}
\newglossaryentry{dsa}{
@ -27,7 +27,7 @@
name={DSA},
long={Intel Data Streaming Accelerator},
first={Intel Data Streaming Accelerator (DSA)},
description={... desc ...}
description={\textsc{\glsentrylong{dsa}:} A component of modern Intel server processors, capable of executing common data operations asynchronously and thereby offloading them from the CPU.}
}
\newglossaryentry{dsa:wq}{
@ -35,7 +35,7 @@
name={WQ},
long={Work Queue},
first={Work Queue (WQ)},
description={... desc ...}
description={\textsc{\glsentrylong{dsa:wq}:} Architectural component of the \gls{dsa} to which data is submitted by the user. See Section \ref{subsec:state:dsa-arch-comp} for more detail on its function.}
}
\newglossaryentry{dsa:swq}{
@ -43,7 +43,7 @@
name={SWQ},
long={Shared Work Queue},
first={Shared Work Queue (SWQ)},
description={... desc ...}
description={\textsc{\glsentrylong{dsa:swq}:} A type of Work Queue to which submissions are implicitly synchronized, allowing safe usage from multiple processes. See Section \ref{subsec:state:dsa-arch-comp} for more detail.}
}
\newglossaryentry{dsa:dwq}{
@ -51,7 +51,7 @@
name={DWQ},
long={Dedicated Work Queue},
first={Dedicated Work Queue (DWQ)},
description={... desc ...}
description={\textsc{\glsentrylong{dsa:dwq}:} A type of Work Queue only usable by one process, and therefore with potentially lower submission overhead. See Section \ref{subsec:state:dsa-arch-comp} for more detail.}
}
\newglossaryentry{pcie-dmr}{
@ -59,7 +59,7 @@
name={DMR},
long={PCIe Deferrable Memory Write Request},
first={PCIe Deferrable Memory Write Request (DMR)},
description={... desc ...}
description={\textsc{\glsentrylong{pcie-dmr}:} \todo{write pcie-dmr description}}
}
\newglossaryentry{x86:enqcmd}{
@ -67,7 +67,7 @@
name={ENQCMD},
long={x86 Instruction ENQCMD},
first={x86 Instruction ENQCMD},
description={... desc ...}
description={\textsc{\glsentrylong{x86:enqcmd}:} \todo{write enqcmd description}}
}
\newglossaryentry{x86:movdir64b}{
@ -75,7 +75,7 @@
name={MOVDIR64B},
long={x86 Instruction MOVDIR64B},
first={x86 Instruction MOVDIR64B},
description={... desc ...}
description={\textsc{\glsentrylong{x86:movdir64b}:} \todo{write movdir64b description}}
}
\newglossaryentry{x86:pasid}{
@ -83,7 +83,7 @@
name={PASID},
long={Process Address Space ID},
first={Process Address Space ID (PASID)},
description={... desc ...}
description={\textsc{\glsentrylong{x86:pasid}:} Identifier used by the \glsentryshort{dsa} in conjunction with the \glsentryshort{iommu} to resolve virtual addresses. See Section \ref{subsubsec:state:dsa-vaddr}.}
}
\newglossaryentry{intel:dml}{
@ -91,7 +91,7 @@
name={Intel DML},
long={Intel Data Mover Library},
first={Intel Data Mover Library (Intel DML)},
description={... desc ...}
description={\textsc{\glsentrylong{intel:dml}:} A library presenting a high-level interface with the \glsentryshort{dsa}. View the usage example in Section \ref{sec:state:dml} or the library documentation \cite{intel:dmldoc} for further information.}
}
\newglossaryentry{numa}{
@ -99,7 +99,7 @@
name={NUMA},
long={Non-Uniform Memory Architecture},
first={Non-Uniform Memory Architecture (NUMA)},
description={... desc ...}
description={\textsc{\glsentrylong{numa}:} Describes a system architecture organized into different Nodes with each node observing different access patterns to memory for the available address range.}
}
\newglossaryentry{numa:node}{
@ -107,7 +107,7 @@
name={Node},
long={NUMA-Node},
first={NUMA-Node (Node)},
description={... desc ...}
description={\textsc{\glsentrylong{numa:node}:} A Node in a NUMA-system. See the Entry for NUMA for an explanation of both.}
}
\newglossaryentry{hbm}{
@ -115,7 +115,7 @@
name={HBM},
long={High Bandwidth Memory},
first={High Bandwidth Memory (HBM)},
description={... desc ...}
description={\textsc{\glsentrylong{hbm}:} Main memory technology, consisting of stacked DRAM-dies. Section \ref{sec:state:hbm} offers more detail.}
}
\newglossaryentry{dram}{
@ -123,7 +123,7 @@
name={DDR-SDRAM},
long={Double Data Rate Synchronous Dynamic Random Access Memory},
first={Double Data Rate Synchronous Dynamic Random Access Memory (DDR-SDRAM)},
description={... desc ...}
description={\textsc{\glsentrylong{dram}:} Main memory technology found in common computer systems.}
}
\newglossaryentry{qdp}{
@ -131,13 +131,14 @@
name={QdP},
long={Query-driven Prefetching},
first={Query-driven Prefetching (QdP)},
description={... desc ...}
description={\textsc{\glsentrylong{qdp}:} Methodology to determine database columns worth prefetching to cache in order to accelerate a queries. Described in Section \ref{sec:state:qdp}.}
}
\newglossaryentry{mempress}{
short={memory pressure},
name={Memory Pressure},
description={... desc ...}
long={Memory Pressure},
description={\textsc{\glsentrylong{mempress}:} Situation where high memory utilization is encountered.}
}
\newglossaryentry{api}{
@ -145,13 +146,14 @@
name={API},
long={Application Programming Interface},
first={Application Programming Interface (API)},
description={... desc ...}
description={\textsc{\glsentrylong{api}:} Public functions exposed by a library, through which programs utilizing this library can interact with it.}
}
\newglossaryentry{remotemem}{
short={Remote Memory},
name={Remote Memory},
description={... desc ...}
long={Remote Memory},
description={\textsc{\glsentrylong{remotemem}:} ... desc ... \todo{write remotemem description}}
}
\newglossaryentry{nvram}{
@ -159,5 +161,5 @@
name={NVRAM},
long={Non-Volatile RAM},
first={Non-Volatile RAM (NVRAM)},
description={... desc ...}
description={\textsc{\glsentrylong{nvram}:} Main memory technology which, unlike \glsentryshort{dram}, retains data when without power.}
}
Loading…
Cancel
Save