Browse Source

modify throughput scaling to display factor-of-baseline and not scaling-factor

master
Constantin Fürst 3 months ago
parent
commit
908c245dca
  1. BIN
      benchmarks/benchmark-plots/plot-dsa-throughput-scaling.pdf
  2. BIN
      benchmarks/benchmark-plotters/__pycache__/common.cpython-39.pyc
  3. 5
      benchmarks/benchmark-plotters/plot-perf-peakthroughput-bar.py
  4. 35
      thesis/content/30_performance.tex
  5. BIN
      thesis/images/plot-dsa-throughput-scaling.pdf

BIN
benchmarks/benchmark-plots/plot-dsa-throughput-scaling.pdf

BIN
benchmarks/benchmark-plotters/__pycache__/common.cpython-39.pyc

5
benchmarks/benchmark-plotters/plot-perf-peakthroughput-bar.py

@ -113,7 +113,7 @@ def main(node_config):
def get_scaling_factor(baseline,topline,utilfactor):
return (topline / baseline) * (1 / utilfactor)
return (topline / baseline)
if __name__ == "__main__":
@ -151,8 +151,7 @@ if __name__ == "__main__":
plt.figure(figsize=(2, 3))
fig = sns.lineplot(x=x_dsacount, y=y_scaling, data=scaling_df, marker='o', linestyle='-', color='b', markersize=8)
plt.xticks([1,2,4,8])
plt.yticks([0.25,0.5,0.75,1.0])
plt.xlim(0,9)
plt.ylim(0.2,1.05)
plt.ylim(0.9,2.1)
plt.savefig(os.path.join(output_path, f"plot-dsa-throughput-scaling.pdf"), bbox_inches='tight')

35
thesis/content/30_performance.tex

@ -42,9 +42,7 @@ For all \gls{dsa}s used in the benchmark, a submission thread executing the inne
\section{Benchmarks}
\label{sec:perf:bench}
In this Section, we will present three benchmarks, each accompanied by setup information and a preview. We will then provide plots displaying the results, followed by a detailed analysis. We will formulate expectations and compare them with the observations from our measurements. \par
\todo{reformulate}
In this section, we will introduce three benchmarks, providing setup information and a brief overview of each. Subsequently, we will present plots illustrating the results, followed by a comprehensive analysis. Our approach involves formulating expectations and juxtaposing them with the observations derived from our measurements. \par
\subsection{Submission Method}
\label{subsec:perf:submitmethod}
@ -87,7 +85,7 @@ Moving data from \glsentryshort{dram} to \gls{hbm} is most relevant to the rest
\[8800\ MT/s \times 8B/T = 70400 \times 10^6 B/s = 65.56\ GiB/s\]
From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56 GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
From the observed bandwidth limitation of a single \gls{dsa} situated at about \(30\ GiB/s\) (see Section \ref{subsec:perf:submitmethod}) and the available memory bandwidth of \(65.56/ GiB/s\), we conclude that a copy task has to be split across multiple \gls{dsa}s to achieve peak throughput. Different methods of splitting will be evaluated. Given that our system consists of multiple sockets, communication crossing between sockets could introduce latency and bandwidth disadvantages \cite{bench:heterogeneous-communication}, which we will also evaluate. Beyond two \gls{dsa}, marginal gains are to be expected, due to the throughput limitation of the available memory. \par
To determine the optimal amount of \gls{dsa}s, we will measure throughput for one, two, four, and eight participating in the copy operations. We name the utilization of two \gls{dsa}s \enquote{Push-Pull}, as with two accelerators, we utilize the ones found on data source and destination \gls{numa:node}. As eight \gls{dsa}s is the maximum available on our system, this configuration will be referred to as \enquote{brute-force}. \par
@ -144,22 +142,18 @@ For the results of the Brute-Force approach illustrated in Figure \ref{fig:perf-
\begin{subfigure}[t]{0.35\textwidth}
\centering
\includegraphics[width=\textwidth]{images/plot-dsa-throughput-scaling.pdf}
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Determined by formula \(Throughput\ /\ Baseline \times 1\ /\ Utilization\) with the Baseline being Throughput for one \gls{dsa} and the Utilization representing the factor of the amount of \gls{dsa}s being used over the baseline.}
\caption{Scaling Factor for different amounts of participating \gls{dsa}. Calculated by dividing the throughput for a configuration by the throughput achieved with one \gls{dsa}. Linear scaling is desirable to not waste resources, therefore only the configurations with one and two \gls{dsa}s are determined to be effective.}
\label{fig:perf-dsa-analysis:scaling}
\end{subfigure}
\caption{Scalability Analysis for different amounts of participating \gls{dsa}s. Displays the average throughput and the derived scaling factor. Shows that, although the throughput does increase with adding more accelerators, beyond two, the gained speed drops significantly. Calculated over the results from Figure \ref{fig:perf-dsa} and therefore applies to copies from \glsentryshort{dram} to \glsentryshort{hbm}.}
\label{fig:perf-dsa-analysis}
\end{figure}
\todo{tp scaling shows assumption: limited benefit of adding more than 2 dsa}
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa:2}, performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa-analysis:scaling}, using Brute-Force still leads to a slight increase in overall throughput, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable. \todo{we state that it decreases but then that it increases} \par
When comparing the Brute-Force approach with Push-Pull in Figure \ref{fig:perf-dsa-analysis:scaling}, average performance decreases by utilizing four times more resources over a longer duration. As shown in Figure \ref{fig:perf-dsa:2}, using Brute-Force still leads to a slight increase in throughput for inter-socket operations, although far from scaling linearly. Therefore, we conclude that, although data movement across the interconnect incurs additional cost, no hard bandwidth limit is observable, reaching the same peak speed also observed for intra-socket with four \gls{dsa}s. This might point to an architectural advantage, as we will encounter the expected speed reduction for copies crossing the socket boundary when executed on the CPU in Section \ref{subsec:perf:cpu-datacopy}. \par
From the average throughput and scaling factors in Figure \ref{fig:perf-dsa-analysis}, it becomes evident that splitting tasks over more than two \gls{dsa}s yields only marginal gains. This could be due to increased congestion of the overall interconnect, however, as no hard limit is encountered, this is not a definitive answer. \par
The choice of a load balancing method is not trivial. If peak throughput of one task is of relevance, Brute-Force for inter-socket and four \gls{dsa}s for intra-socket operation result in the fastest transfers. At the same time, these cause high system utilization, making them unsuitable for situations where multiple tasks may be submitted. For these cases, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling — see Figure \ref{fig:perf-dsa-analysis:scaling}. \par
\todo{dont choose brute here, 4dsa is best}
The choice of a load balancing method is not trivial. Consulting Figure \ref{fig:perf-dsa-analysis:average}, the highest throughput is achieved by using four \gls{dsa}s. At the same time, this causes high system utilization, making it unsuitable for situations where resources are to be distributed among multiple control flows. For this case, Push-Pull achieves performance close to the real-world peak while also not wasting resources due to poor scaling (see Figure \ref{fig:perf-dsa-analysis:scaling}). \par
\subsection{Data Movement using CPU}
\label{subsec:perf:cpu-datacopy}
@ -185,30 +179,23 @@ For evaluating CPU copy performance we use the benchmark code from the previous
\label{fig:perf-cpu}
\end{figure}
As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. As the sole benchmark, the software path however performs as expected for transfers to \gls{numa:node}s 12 and 15, with the latter performing worse \todo{false statement}. Taking the layout from Figure \ref{fig:perf-xeonmaxnuma}, back in the previous Section, we assumed that \gls{numa:node} 12 would outperform \gls{numa:node} 15 due to lower physical distance. This assumption was invalidated, making the result for CPU in this case unexpected. \par
In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation. This validates the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as seen in Section \ref{subsec:perf:datacopy} can be observed in Figure \ref{fig:perf-cpu:andrepeak}. As the results from software path do not exhibit this, the anomaly seems to only occur in bandwidth-saturating scenarios \todo{false they exhibit it, but not as pronounced}. \par
\todo{ensure correctness}
As evident from Figure \ref{fig:perf-cpu:swpath}, the observed throughput of software path is less than half of the theoretical bandwidth. Therefore, software path is to be treated as a compatibility measure, and not for providing high performance data copy operations. In Figure \ref{fig:perf-cpu:andrepeak}, peak throughput is achieved for intra-node operation, validating the assumption that there is a cost for communicating across sockets, which was not as directly observable with the \gls{dsa}. The same disadvantage for \gls{numa:node} 12, as observed in Section \ref{subsec:perf:datacopy}, can be seen in Figure \ref{fig:perf-cpu}. This points to an architectural anomaly which we could not explain with our knowledge or benchmarks. Further benchmarks were conducted for this, not yielding conclusive results, as the source and copy-thread location did not seem to affect the observed speed delta. \par
\section{Analysis}
\label{sec:perf:analysis}
In this section we summarize the conclusions drawn from the three benchmarks performed in the sections above and outline a utilization guideline \todo{we dont do this, either write it or leave this out}. We also compare CPU and \gls{dsa} for the task of copying data from \gls{dram} to \gls{hbm} \todo{weird wording}. \par
In this section we summarize the conclusions from the performed benchmarks, outlining a utilization guideline. \par
\begin{itemize}
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\) and otherwise use the CPU \todo{why otherwise cpu, no direct link given}.
\item From \ref{subsec:perf:submitmethod} we conclude that small copies under \(1\ MiB\) in size require batching and still do not reach peak performance. Task size should therefore be at or above \(1\ MiB\). Otherwise, offloading might prove more expensive than performing the copy on CPU.
\item Section \ref{subsec:perf:mtsubmit} assures that access from multiple threads does not negatively affect the performance when using \glsentrylong{dsa:swq} for work submission. Due to the lack of \glsentrylong{dsa:dwq} support, we have no data to determine the cost of submission to the \gls{dsa:swq}.
\item In \ref{subsec:perf:datacopy}, we found that using more than two \gls{dsa}s results in only marginal gains. The choice of a load balancer therefore is the Push-Pull configuration, as it achieves fair throughput with low utilization.
\item Combining the result from Sections \ref{subsec:perf:datacopy} and \ref{subsec:perf:submitmethod}, we posit that for situations with smaller transfer sizes and a high amount of tasks, splitting a copy might prove disadvantageous. Due to incurring more delay from submission and overall throughput still remaining high without the split due to queue filling (see Section \ref{subsec:perf:mtsubmit}), the split might reduce overall effectiveness. To still utilize the available resources effectively, distributing tasks across the available \gls{dsa}s is still desirable. This finding lead us to implement round-robin balancing in Section \ref{sec:impl:application}.
\end{itemize}
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. In any case, \gls{dsa} performs similar to the CPU, demonstrating potential for faster data movement while simultaneously freeing up cycles for other tasks. \todo{ensure correctness and reformulate} \par
We discovered an anomaly for \gls{numa:node} 12 for which we did not find an explanation. As the behaviour is also exhibited by the CPU, discovering the root issue falls outside the scope of this work. \par
\todo{mention measures undertaken to find an explanation here}
Once again, we refer to Figures \ref{fig:perf-dsa} and \ref{fig:perf-cpu}, both representing the maximum throughput achieved with the utilization of either \gls{dsa} for the former and CPU for the latter. Noticeably, the \gls{dsa} does not seem to suffer from inter-socket overhead like the CPU. The \gls{dsa} performs similar to the CPU for intra-node data movement, while outperforming it in inter-node scenarios. The latter, as mentioned in Section \ref{subsec:perf:datacopy}, might point to an architectural advantage of the \gls{dsa}. The performance observed in the above benchmarks demonstrates potential for rapid data movement while simultaneously relieving the CPU of this task and thereby freeing capacity then available for computation. \par
Even though we could not find an explanation for all measurements, this chapter still gives insight into the performance of the \gls{dsa}, its strengths and its weaknesses. It provides data-driven guidance on a complex architecture, helping to find the optimum for applying the \gls{dsa} to our expected and possibly different workloads. \par
We encountered an anomaly on \gls{numa:node} 12 for which we were unable to find an explanation. Since this behaviour is also observed on the CPU, identifying the root cause falls beyond the scope of this work. Despite being unable to account for all measurements, this chapter still offers valuable insights into the performance of the \gls{dsa}, highlighting both its strengths and weaknesses. It provides data-driven guidance for a complex architecture, aiding in determining the optimal approach for optimal utilization of the \gls{dsa}. \par
%%% Local Variables:
%%% TeX-master: "diplom"

BIN
thesis/images/plot-dsa-throughput-scaling.pdf

Loading…
Cancel
Save