\caption{\glsentrylong{dsa} Internal Architecture. Shows the components that the chip is made up of, how they are connected and which outside components the \glsentryshort{dsa} communicates with. \cite[Fig. 1 (a)]{intel:analysis}}
@ -96,7 +96,7 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
\caption{\glsentrylong{dsa} Software View. Illustrating the software stack and internal interactions from user applications, through the driver to the portal for work submission. \cite[Fig. 1 (b)]{intel:analysis}}
@ -114,7 +114,7 @@ With some limitations, like lacking support for \gls{dsa:dwq} submission, this l
As mentioned in Section \ref{subsec:state:dsa-software-view}, \gls{intel:dml} offers a high level interface for interacting with the hardware accelerator, specifically Intel \gls{dsa}. Opting for the C++ interface, we will now demonstrate its usage by example of a simple memcopy implementation for the \gls{dsa}. \par
\caption{\glsentrylong{dml} Memcpy Implementation Pseudocode. Performs copy operation of a block of memory from source to destination. The \glsentryshort{dsa} executing this copy can be selected with the parameter \texttt{node}, and the template parameter \texttt{path} elects whether to run on hardware (Intel \glsentryshort{dsa}) or software (CPU).}
\caption{Xeon Max \glsentrylong{node} Layout \cite[Fig. 14]{intel:maxtuning} for a 2-Socket System when configured with HBM-Flat. Showing separate \glsentryshort{numa:node} IDs for manual \glsentryshort{hbm} access and for Cores and \glsentryshort{dram}.}
@ -23,7 +23,7 @@ The benchmark performs node setup as described in Section \ref{sec:state:dml} an
Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par
\caption{Outer Benchmark Procedure Pseudocode. Timing marked with yellow background. Showing preparation of memory locations, clearing of cache entries, timing points and synchronized benchmark launch.}
@ -32,7 +32,7 @@ Timing in the outer loop may display lower throughput than actual. This is the c
To get accurate results, the benchmark is repeated 10 times. Each iteration is timed from beginning to end, marked by yellow in Figure \ref{fig:benchmark-function:outer}. For small task sizes, the iterations complete in a very short amount of time, which can have adverse effects on the results. Therefore, we repeat the code of the inner loop for a configurable amount, virtually extending the duration of a single iteration for these cases. \par
\caption{Inner Benchmark Procedure Pseudocode. Showing work submission for single and batch submission.}
@ -53,7 +53,7 @@ With each submission, descriptors must be prepared and sent to the underlying ha
We anticipate that single submissions will consistently yield poorer performance, particularly with a pronounced effect on smaller transfer sizes. This expectation arises from the fact that the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, whereas the batch experiences this overhead only once for multiple copies. \par
\caption{Throughput for different Submission Methods and Sizes. Performing a copy with source and destination being node 0, executed by the \glsentryshort{dsa} on node 0. Observable is the submission cost which affects small transfer sizes differently, as there the completion time is lower.}
@ -69,7 +69,7 @@ Another limitation may be observed in this result, namely the inherent throughpu
As we might encounter access to one \gls{dsa} from multiple threads through the associated \glsentrylong{dsa:swq}, understanding the impact of this type of access is crucial. We benchmark multithreaded submission for one, two, and twelve threads, with the latter representing the core count of one processing sub-node on the test system. We spawn multiple threads, all submitting to one \gls{dsa}. Furthermore, we perform this benchmark with sizes of 1 MiB and 1 GiB to examine, if the behaviour changes with submission size. For smaller sizes, the completion time may be faster than submission time, leading to potentially different effects of threading due to the fact that multiple threads work to fill the queue, preventing task starvation. We may also experience lower-than-peak throughput with rising thread count, caused by the synchronization inherent with \gls{dsa:swq}. \par
\caption{Throughput for different Thread Counts and Sizes. Multiple threads submit to the same Shared Work Queue. Performing a copy with source and destination being node 0, executed by the DSA on node 0.}
@ -101,7 +101,7 @@ To determine the optimal amount of \gls{dsa}s, we will measure throughput for on
For this benchmark, we transfer 1 Gibibyte of data from node 0 to the destination node. We present data for nodes 8, 11, 12, and 15. To understand the selection, refer to Figure \ref{fig:perf-xeonmaxnuma}, which illustrates the node IDs of the configured systems and the corresponding storage technology. Node 8 accesses the \gls{hbm} on node 0, making it the physically closest possible destination. Node 11 is located diagonally on the chip, representing the farthest intra-socket operation benchmarked. Nodes 12 and 15 lie diagonally on the second socket's CPU, making them representative of inter-socket transfer operations. \par
\begin{figure}[h!tb]
\begin{figure}[!t]
\centering
\begin{subfigure}[t]{0.225\textwidth}
\centering
@ -140,16 +140,16 @@ Unexpected throughput differences are evident for all configurations, except the
For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-dsa:8}, we observe peak speeds when copying across sockets from \gls{numa:node} 0 to \gls{numa:node} 15. This contradicts our assumption that peak bandwidth would be limited by the interconnect. However, for intra-node copies, there is an observable penalty for using the off-socket \gls{dsa}s. We will analyse this behaviour by comparing the different benchmarked configurations and summarize our findings on scalability. \par
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \\\(\frac{Throughput}{Basline\ Throughput}*\frac{1}{Utilization\ Factor}\)\\ with the baseline being Throughput for 1 \gls{dsa} and the utilization factor representing the factor of the amount of \gls{dsa}s being used over the baseline.}
@ -170,7 +170,7 @@ The choice of a load balancing method is not trivial. If peak throughput of one
For evaluating CPU copy performance we use the benchmark code from the subsequent Section \ref{subsec:perf:datacopy}, selecting the software instead of hardware execution path (see Section \ref{subsec:state:dsa-software-view}). Colleagues performed extensive benchmarking of the peak throughput on CPU for the test system \cite{xeonmax-peakthroughput}, from which we will present results as well. We compare expectations and results from the previous Section with the measurements.\par
@ -26,7 +26,7 @@ The task of prefetching is somewhat aligned with that of a cache. As a cache is
The interface of \texttt{Cache} must provide three basic functions: (1) requesting a memory block to be cached, (2) accessing a cached memory block and (3) synchronizing cache with the source memory. The latter operation comes in to play when the data that is cached may also be modified, necessitating an update either from the source or vice versa. Due various setups and use cases for this cache, the user should also be responsible for choosing cache placement and the copy method. As re-caching is resource intensive, data should remain in the cache for as long as possible. We only flush entries, when lack of free cache memory requires it. \par
\caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
@ -46,7 +46,7 @@ Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the i
Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared atomic integer is either incremented or decremented using atomic fetch sub and add operations to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\) and \(T_3\). Then \(T_1\) holds the handlers exclusively, leading to the other threads having to wait for \(T_1\) to perform the work submission and waiting before they can access the datum through the cache.}
@ -57,7 +57,7 @@ Due to the possibility of access by multiple threads, the implementation of \tex
To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that three threads \(T_1\), \(T_2\) and \(T_3\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) and \(T_3\) obtain access to the incomplete \texttt{CacheData} on which they wait, causing them to see a \texttt{nullptr} for the handlers but invalid cache pointer, leading to atomic wait on the cache pointer (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) and \(T_3\) continue to wait. Therefore, only \(T_1\) can trigger the waiting and is therefore capable of keeping \(T_2\) and \(T_3\) from progressing. This is undesirable as it can lead to deadlocking if by some reason \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\) and \(T_3\) if \(T_1\) does not wait immediately. \par
@ -33,7 +33,7 @@ In this section, we will present our findings from integrating the \texttt{Cache
We benchmarked two methods to establish a baseline and an upper limit as reference points. In the former, all columns are located in \glsentryshort{dram}. The latter method simulates perfect prefetching without delay and overhead by placing column \texttt{b} in \gls{hbm} during benchmark initialization. \par
\begin{table}[h!tb]
\begin{table}[!t]
\centering
\input{tables/table-qdp-baseline.tex}
\caption{Table showing raw timing for \gls{qdp} on \glsentryshort{dram} and \gls{hbm}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Raw Time is averaged over 5 iterations with previous warm up.}
@ -44,16 +44,16 @@ From Table \ref{table:qdp-baseline}, it is evident that accessing column \texttt
The following plots are normalized so that the longest execution from Figures \ref{fig:timing-comparison} and \ref{fig:timing-results} fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-pipeline timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and Figures may be observed. \par
\caption{Column \texttt{a} located in \glsentryshort{dram} and \texttt{b} in \glsentryshort{hbm}.}
@ -69,7 +69,7 @@ Due to the higher bandwidth provided by \gls{hbm} for \(AGGREGATE\), the CPU wai
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario, through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
\begin{table}[h!tb]
\begin{table}[!t]
\centering
\input{tables/table-qdp-speedup.tex}
\caption{Table showing Speedup for different \glsentryshort{qdp} Configurations over \glsentryshort{dram}. Result for \glsentryshort{dram} serves as baseline while \glsentryshort{hbm} presents the upper boundary achievable with perfect prefetching. Prefetching was performed with the same parameters and data locations as \gls{dram}, caching on Node 8 (\glsentryshort{hbm} accessor for the executing Node 0). Prefetching with Distributed Columns had columns \texttt{a} and \texttt{b} located on different Nodes. Raw Time is averaged over 5 iterations with previous warm up.}
@ -78,16 +78,16 @@ To address the challenges posed by sharing memory bandwidth between both \(SCAN\
The slowdown below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, this result becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, and there is additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase compared to our baseline, although it does not reach the upper boundary set by simulating perfect prefetching. This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par