Browse Source

fix sequence diagram thread numbering and update figure apperance, necessitating some small text changes and pagebreaks

master
Constantin Fürst 3 months ago
parent
commit
264db5d9d7
  1. BIN
      thesis/bachelor.pdf
  2. 16
      thesis/content/30_performance.tex
  3. 4
      thesis/content/40_design.tex
  4. 4
      thesis/content/50_implementation.tex
  5. 12
      thesis/images/image-source/sequenzdiagramm-waitoncompletion.xml
  6. BIN
      thesis/images/seq-blocking-wait.pdf

BIN
thesis/bachelor.pdf

16
thesis/content/30_performance.tex

@ -17,9 +17,7 @@ The benchmarks were conducted on a dual-socket server equipped with two Intel Xe
As \gls{intel:dml} does not have support for \glsentryshort{dsa:dwq}s (see Section \ref{sec:state:dml}), we run benchmarks exclusively with access through \glsentryshort{dsa:swq}s. The application written for the benchmarks can be inspected in source form under the directory \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. \par
The benchmark performs \gls{numa:node}-setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par
\pagebreak
\begin{figure}[t]
\centering
@ -28,6 +26,10 @@ Timing in the outer loop may display lower throughput than actual. This is the c
\label{fig:benchmark-function:outer}
\end{figure}
The benchmark performs \gls{numa:node}-setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the \gls{numa:node}s passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts, as demonstrated by the calls to \texttt{memset} in Figure \ref{fig:benchmark-function:outer}. We therefore also do not use \enquote{.block\_on\_fault()}, as we did for the memcpy-example in Section \ref{sec:state:dml}. \par
Timing in the outer loop may display lower throughput than actual. This is the case, should one of the \gls{dsa}s participating in a given task finish earlier than the others. We decided to measure the maximum time and therefore minimum throughput for these cases, as we want the benchmarks to represent the peak achievable for distributing one task over multiple engines and not executing multiple tasks of a disjoint set. As a task can only be considered complete when all subtasks are completed, the minimum throughput represents this scenario. This may give an advantage to the peak CPU throughput benchmark we will reference later on, as it does not have this restriction placed upon it. \par
\begin{figure}[t]
\centering
\includegraphics[width=1.0\textwidth]{images/nsd-benchmark-inner.pdf}
@ -151,10 +153,14 @@ For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par
\pagebreak
From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the processors' interconnect, however, as no hard limit is encountered, this is not a definitive answer. Utilizing off-socket \gls{dsa}s proves disadvantageous in the average case, reinforcing the claim of communication overhead for crossing the socket boundary. \par
The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par
\pagebreak
\begin{figure}[t]
\centering
\begin{subfigure}[t]{0.35\textwidth}
@ -181,6 +187,8 @@ For evaluating CPU copy performance we use the benchmark code from the previous
As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation, validating the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as observed in Section \ref{subsec:perf:datacopy}, can be seen in Figure \ref{fig:perf-cpu}. This points to an architectural anomaly which we could not explain with our knowledge or benchmarks. Further benchmarks were conducted for this, not yielding conclusive results, as the source and copy-thread location did not seem to affect the observed speed delta. \par
\pagebreak
\section{Analysis}
\label{sec:perf:analysis}
@ -193,8 +201,6 @@ In this section we summarize the conclusions from the performed benchmarks, outl
\item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding led us to implement round-robin balancing in Section \ref{sec:impl:application}.
\end{itemize}
\pagebreak
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computational tasks. \par
We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, pointing to a problem in the system architecture, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in the determination of an optimal utilization strategy for the \gls{dsa}. \par

4
thesis/content/40_design.tex

@ -73,8 +73,8 @@ To configure runtime behaviour for page fault handling and access type, we intro
\begin{figure}[t]
\centering
\includegraphics[width=0.7\textwidth]{images/overlapping-pointer-access.pdf}
\caption{Public Interface of \texttt{CacheData} and \texttt{Cache} Classes. Colour coding for thread safety. Grey denotes impossibility for threaded access. Green indicates full safety guarantees only relying on atomics to achieve this. Yellow may use locking but is still safe for use. Red must be called from a single threaded context.}
\includegraphics[width=0.4\textwidth]{images/overlapping-pointer-access.pdf}
\caption{Visualization of a memory region and pointers \(A\) and \(B\). Due to the size of the data pointed to, the two memory regions overlap. The overlapping region is highlighted by the dashes.}
\label{fig:overlapping-access}
\end{figure}

4
thesis/content/50_implementation.tex

@ -53,11 +53,11 @@ Upon call to \texttt{Cache::Access}, coordination must take place to only add on
\begin{figure}[t]
\centering
\includegraphics[width=0.9\textwidth]{images/seq-blocking-wait.pdf}
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\). Then \(T_1\) holds the handlers exclusively, leading to \(T_2\) having to wait for \(T_1\) to perform the work submission and waiting.}
\caption{Sequence for Blocking Scenario. Observable in first draft implementation. Scenario where \(T_1\) performed first access to a datum followed \(T_2\). \(T_2\) calls \texttt{CacheData::WaitOnCompletion} before \(T_1\) submits the work and adds the handlers to the shared instance. Thereby, \(T_2\) gets stuck waiting on the cache pointer to be updated, making \(T_1\) responsible for permitting \(T_2\) to progress.}
\label{fig:impl-cachedata-threadseq-waitoncompletion}
\end{figure}
In the first implementation, a thread would check if the handlers are available and wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that two threads \(T_1\) and \(T_2\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) obtains access to the incomplete \texttt{CacheData} on which it calls \texttt{CacheData::WaitOnCompletion}, causing it to see \texttt{nullptr} for the cache pointer and handlers. Therefore, \(T_2\) waits on the cache pointer becoming valid (marked blue lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}). Then \(T_1\) submits the work and sets the handlers (marked red lines in Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}), while \(T_2\) continues to wait on the cache pointer. Therefore, only \(T_1\) can trigger the waiting on the handlers and is therefore capable of keeping \(T_2\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\). \par
In the first implementation, a thread would check if the handlers are available and wait on a value change from \texttt{nullptr}, if they are not. As the handlers are only available after submission, a situation could arise where only one copy of \texttt{CacheData} is capable of actually waiting on them. To illustrate this, an exemplary scenario is used, as seen in the sequence diagram Figure \ref{fig:impl-cachedata-threadseq-waitoncompletion}. Assume that two threads \(T_1\) and \(T_2\) wish to access the same resource. \(T_1\) is the first to call \texttt{CacheData::Access} and therefore adds it to the cache state and will perform the work submission. Before \(T_1\) may submit the work, it is interrupted and \(T_2\) obtains access to the incomplete \texttt{CacheData} on which it calls \texttt{CacheData::WaitOnCompletion}, causing it to see \texttt{nullptr} for the cache pointer and handlers. \(T_2\) then waits on the cache pointer becoming valid, followed by \(T_1\) submitting the work and setting the handlers. Therefore, only \(T_1\) can trigger the waiting on the handlers and is capable of keeping \(T_2\) from progressing. This is undesirable as it can lead to deadlocking if \(T_1\) does not wait and at the very least may lead to unnecessary delay for \(T_2\). \par
\pagebreak

12
thesis/images/image-source/sequenzdiagramm-waitoncompletion.xml

@ -1,6 +1,6 @@
<mxfile host="app.diagrams.net" modified="2024-02-19T15:56:33.643Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" etag="rQ8FcWYpjkzjEaIhE64A" version="23.1.5" type="device">
<mxfile host="app.diagrams.net" modified="2024-02-19T16:27:36.947Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" etag="wIL1mBWOSZCqiDPh49Ts" version="23.1.5" type="device">
<diagram name="Page-1" id="2YBvvXClWsGukQMizWep">
<mxGraphModel dx="1195" dy="669" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="700" pageHeight="350" math="0" shadow="0">
<mxGraphModel dx="1434" dy="803" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="700" pageHeight="350" math="0" shadow="0">
<root>
<mxCell id="0" />
<mxCell id="1" parent="0" />
@ -10,7 +10,7 @@
<mxCell id="aM9ryv3xv72pqoxQDRHE-2" value="" style="html=1;points=[];perimeter=orthogonalPerimeter;outlineConnect=0;targetShapes=umlLifeline;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="aM9ryv3xv72pqoxQDRHE-1" vertex="1">
<mxGeometry x="45" y="110" width="10" height="170" as="geometry" />
</mxCell>
<mxCell id="RjI6kM-8N4aADmqfwwHL-1" value="Thread 1" style="shape=umlLifeline;perimeter=lifelinePerimeter;whiteSpace=wrap;html=1;container=0;dropTarget=0;collapsible=0;recursiveResize=0;outlineConnect=0;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="1" vertex="1">
<mxCell id="RjI6kM-8N4aADmqfwwHL-1" value="Thread 2" style="shape=umlLifeline;perimeter=lifelinePerimeter;whiteSpace=wrap;html=1;container=0;dropTarget=0;collapsible=0;recursiveResize=0;outlineConnect=0;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="1" vertex="1">
<mxGeometry x="290" y="30" width="100" height="300" as="geometry" />
</mxCell>
<mxCell id="RjI6kM-8N4aADmqfwwHL-2" value="" style="html=1;points=[];perimeter=orthogonalPerimeter;outlineConnect=0;targetShapes=umlLifeline;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="RjI6kM-8N4aADmqfwwHL-1" vertex="1">
@ -19,7 +19,7 @@
<mxCell id="RjI6kM-8N4aADmqfwwHL-29" value="" style="html=1;points=[];perimeter=orthogonalPerimeter;outlineConnect=0;targetShapes=umlLifeline;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="RjI6kM-8N4aADmqfwwHL-1" vertex="1">
<mxGeometry x="45" y="260" width="10" height="20" as="geometry" />
</mxCell>
<mxCell id="RjI6kM-8N4aADmqfwwHL-3" value="Thread 2" style="shape=umlLifeline;perimeter=lifelinePerimeter;whiteSpace=wrap;html=1;container=0;dropTarget=0;collapsible=0;recursiveResize=0;outlineConnect=0;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="1" vertex="1">
<mxCell id="RjI6kM-8N4aADmqfwwHL-3" value="Thread 1" style="shape=umlLifeline;perimeter=lifelinePerimeter;whiteSpace=wrap;html=1;container=0;dropTarget=0;collapsible=0;recursiveResize=0;outlineConnect=0;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="1" vertex="1">
<mxGeometry x="410" y="30" width="100" height="300" as="geometry" />
</mxCell>
<mxCell id="RjI6kM-8N4aADmqfwwHL-4" value="" style="html=1;points=[];perimeter=orthogonalPerimeter;outlineConnect=0;targetShapes=umlLifeline;portConstraint=eastwest;newEdgeStyle={&quot;edgeStyle&quot;:&quot;elbowEdgeStyle&quot;,&quot;elbow&quot;:&quot;vertical&quot;,&quot;curved&quot;:0,&quot;rounded&quot;:0};" parent="RjI6kM-8N4aADmqfwwHL-3" vertex="1">
@ -74,7 +74,7 @@
<mxPoint as="offset" />
</mxGeometry>
</mxCell>
<mxCell id="7ueDTxc7tV6g5FLCigbC-2" value="T2: wait" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" vertex="1" connectable="0" parent="RjI6kM-8N4aADmqfwwHL-20">
<mxCell id="7ueDTxc7tV6g5FLCigbC-2" value="T2: wait" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];" parent="RjI6kM-8N4aADmqfwwHL-20" vertex="1" connectable="0">
<mxGeometry x="-0.0889" y="-16" relative="1" as="geometry">
<mxPoint x="-1" y="6" as="offset" />
</mxGeometry>
@ -104,7 +104,7 @@
<mxPoint x="365" y="300.1" as="targetPoint" />
</mxGeometry>
</mxCell>
<mxCell id="7ueDTxc7tV6g5FLCigbC-1" value="return" style="html=1;verticalAlign=bottom;endArrow=classicThin;edgeStyle=elbowEdgeStyle;elbow=vertical;curved=0;rounded=0;endFill=1;dashed=1;" edge="1" parent="1" source="aM9ryv3xv72pqoxQDRHE-2" target="RjI6kM-8N4aADmqfwwHL-4">
<mxCell id="7ueDTxc7tV6g5FLCigbC-1" value="return" style="html=1;verticalAlign=bottom;endArrow=classicThin;edgeStyle=elbowEdgeStyle;elbow=vertical;curved=0;rounded=0;endFill=1;dashed=1;" parent="1" source="aM9ryv3xv72pqoxQDRHE-2" target="RjI6kM-8N4aADmqfwwHL-4" edge="1">
<mxGeometry relative="1" as="geometry">
<mxPoint x="190" y="269.17" as="sourcePoint" />
<Array as="points">

BIN
thesis/images/seq-blocking-wait.pdf

Loading…
Cancel
Save