diff --git a/thesis/bachelor.pdf b/thesis/bachelor.pdf index ffeedef..016b7b3 100644 Binary files a/thesis/bachelor.pdf and b/thesis/bachelor.pdf differ diff --git a/thesis/content/30_performance.tex b/thesis/content/30_performance.tex index 71fd841..cfe658a 100644 --- a/thesis/content/30_performance.tex +++ b/thesis/content/30_performance.tex @@ -17,9 +17,7 @@ The benchmarks were conducted on a dual-socket server equipped with two Intel Xe As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s (see Section \ref{sec:state:dml}), we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be inspected in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par -The benchmark performs \gls{numa:node}-setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par - -Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par +\pagebreak \begin{figure}[t] \centering @@ -28,6 +26,10 @@ Timing in the outer loop may display lower throughput than actual. This is the c \label{fig:benchmark-function:outer} \end{figure} +The benchmark performs \gls{numa:node}-setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts, as demonstrated by the calls to \texttt{memset} in Figure \ref{fig:benchmark-function:outer}. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par + +Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par + \begin{figure}[t] \centering \includegraphics[width=1.0\textwidth]{images/nsd-benchmark-inner.pdf} @@ -151,10 +153,14 @@ For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf- When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par +\pagebreak + From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the processors' interconnect, however, as no hard limit is encountered, this is not a definitive answer. Utilizing off-socket \gls{dsa}s proves disadvantageous in the average case, reinforcing the claim of communication overhead for crossing the socket boundary. \par The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par +\pagebreak + \begin{figure}[t] \centering \begin{subfigure}[t]{0.35\textwidth} @@ -181,6 +187,8 @@ For evaluating CPU copy performance we use the benchmark code from the previous As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation, validating the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as observed in Section \ref{subsec:perf:datacopy}, can be seen in Figure \ref{fig:perf-cpu}. This points to an architectural anomaly which we could not explain with our knowledge or benchmarks. Further benchmarks were conducted for this, not yielding conclusive results, as the source and copy-thread location did not seem to affect the observed speed delta. \par +\pagebreak + \section{Analysis} \label{sec:perf:analysis} @@ -193,8 +201,6 @@ In this section we summarize the conclusions from the performed benchmarks, outl \item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding led us to implement round-robin balancing in Section \ref{sec:impl:application}. \end{itemize} -\pagebreak - Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computational tasks. \par We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, pointing to a problem in the system architecture, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in the determination of an optimal utilization strategy for the \gls{dsa}. \par diff --git a/thesis/content/40_design.tex b/thesis/content/40_design.tex index 70de22d..8c9fd66 100644 --- a/thesis/content/40_design.tex +++ b/thesis/content/40_design.tex @@ -73,8 +73,8 @@ To configure runtime behaviour for page fault handling and access type, we intro \begin{figure}[t] \centering - \includegraphics[width=0.7\textwidth]{images/overlapping-pointer-access.pdf} - \caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.} + \includegraphics[width=0.4\textwidth]{images/overlapping-pointer-access.pdf} + \caption{Visualization of a memory region and pointers \(A\) and \(B\). Due to the size of the data pointed to, the two memory regions overlap. The overlapping region is highlighted by the dashes.} \label{fig:overlapping-access} \end{figure} diff --git a/thesis/content/50_implementation.tex b/thesis/content/50_implementation.tex index ec55f06..f807837 100644 --- a/thesis/content/50_implementation.tex +++ b/thesis/content/50_implementation.tex @@ -53,11 +53,11 @@ Upon call to \texttt{Cache::Access}, coordination must take place to only add on \begin{figure}[t] \centering \includegraphics[width=0.9\textwidth]{images/seq-blocking-wait.pdf} - \caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\). Then \(T_1\) holds the handlers exclusively, leading to \(T_2\) having to wait for \(T_1\) to perform the work submission and waiting.} + \caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\). \(T_2\) calls \texttt{CacheData::WaitOnCompletion} before \(T_1\) submits the work and adds the handlers to the shared instance. Thereby, \(T_2\) gets stuck waiting on the cache pointer to be updated, making \(T_1\) responsible for permitting \(T_2\) to progress.} \label{fig:impl-cachedata-threadseq-waitoncompletion} \end{figure} -In the first implementation, a thread would check if the handlers are available and wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that two threads \(T_1\) and \(T_2\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) obtains access to the incomplete \texttt{CacheData} on which it calls \texttt{CacheData::WaitOnCompletion}, causing it to see \texttt{nullptr} for the cache pointer and handlers. Therefore, \(T_2\) waits on the cache pointer becoming valid (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Then \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) continues to wait on the cache pointer. Therefore, only \(T_1\) can trigger the waiting on the handlers and is therefore capable of keeping \(T_2\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\). \par +In the first implementation, a thread would check if the handlers are available and wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that two threads \(T_1\) and \(T_2\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) obtains access to the incomplete \texttt{CacheData} on which it calls \texttt{CacheData::WaitOnCompletion}, causing it to see \texttt{nullptr} for the cache pointer and handlers. \(T_2\) then waits on the cache pointer becoming valid, followed by \(T_1\) submitting the work and setting the handlers. Therefore, only \(T_1\) can trigger the waiting on the handlers and is capable of keeping \(T_2\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\). \par \pagebreak diff --git a/thesis/images/image-source/sequenzdiagramm-waitoncompletion.xml b/thesis/images/image-source/sequenzdiagramm-waitoncompletion.xml index 7e769c9..f56544a 100644 --- a/thesis/images/image-source/sequenzdiagramm-waitoncompletion.xml +++ b/thesis/images/image-source/sequenzdiagramm-waitoncompletion.xml @@ -1,6 +1,6 @@ - + - + @@ -10,7 +10,7 @@ - + @@ -19,7 +19,7 @@ - + @@ -74,7 +74,7 @@ - + @@ -104,7 +104,7 @@ - + diff --git a/thesis/images/seq-blocking-wait.pdf b/thesis/images/seq-blocking-wait.pdf index d1f14e4..9d8af1b 100644 Binary files a/thesis/images/seq-blocking-wait.pdf and b/thesis/images/seq-blocking-wait.pdf differ