Browse Source

continue writing chapter 3, microbenchmarks

master
Constantin Fürst 11 months ago
parent
commit
ac3288169a
  1. BIN
      thesis/bachelor.pdf
  2. 1
      thesis/content/20_state.tex
  3. 20
      thesis/content/30_performance.tex
  4. 7
      thesis/own.gls

BIN
thesis/bachelor.pdf

1
thesis/content/20_state.tex

@ -86,6 +86,7 @@ Given the file permissions, it would now be possible for a process to submit wor
With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents a high-level interface that takes care of creation and submission of descriptors, some error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code based upon it automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par
\section{Programming Interface} \section{Programming Interface}
\label{sec:state:dml}
As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} provides a high level interface to interact with the hardware accelerator, namely Intel \gls{dsa}. We choose to use the C++ interface and will now demonstrate its usage by example of a simple memcopy-implementation for the \gls{dsa}. \par As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml} provides a high level interface to interact with the hardware accelerator, namely Intel \gls{dsa}. We choose to use the C++ interface and will now demonstrate its usage by example of a simple memcopy-implementation for the \gls{dsa}. \par

20
thesis/content/30_performance.tex

@ -1,26 +1,24 @@
\chapter{Performance Microbenchmarks} \chapter{Performance Microbenchmarks}
\label{chap:perf} \label{chap:perf}
\todo{write introductory paragraph}
mention article by reese cooper here
The performance of \gls{dsa} has been evaluated in great detail by Reese Kuper et al. in \cite{intel:analysis}. Therefore, we will perform only a limited amount of benchmarks with the purpose of verifying the figures from \cite{intel:analysis} and analysing best practices and restrictions for applying \gls{dsa} to \gls{qdp}. \par
\section{Benchmarking Methodology} \section{Benchmarking Methodology}
\begin{figure}[h] \begin{figure}[h]
\centering \centering
\includegraphics[width=0.9\textwidth]{images/structo-benchmark.png}
\caption{Benchmark Procedure Pseudo-Code}
\includegraphics[width=0.9\textwidth]{images/structo-benchmark-compact.png}
\caption{Benchmark Procedure Pseudocode}
\label{fig:benchmark-function} \label{fig:benchmark-function}
\end{figure} \end{figure}
\todo{split graphic into multiple parts for the three submission types}
Benchmarks were conducted on an Intel Xeon Max CPU, system configuration following Section \ref{sec:state:setup-and-config} with exclusive access to the system. As Intel's \gls{intel:dml} does not have support for \gls{dsa:dwq}, we ran benchmarks exclusively with access through \gls{dsa:swq}. The application written for the benchmarks can be obtained in source form under \texttt{benchmarks} in the thesis repository \cite{thesis-repo}. With the full source code available we only briefly describe a section of pseudocode, as seen in Figure \ref{fig:benchmark-function}, in the following paragraph. \par
\todo{write this section}
The benchmark performs node setup as described in Section \ref{sec:state:dml} and allocates source and destination memory on the nodes passed in as parameters. To avoid page faults affecting the results, the entire memory regions are written to before the timed part of the benchmark starts. These timings are marked with yellow background in Figure \ref{fig:benchmark-function}, which shows that we measure three regions during our benchmark with one being the total time and the other two being time for submission and time for completion. To get accurate results, the benchmark is repeated multiple times and for small durations we only evaluate the total time, as the single measurements seemed to contain to many irregularities. At the beginning of each repetition, all threads running will synchronize by use of a barrier. The behaviour then differs depending on the submission method selected which can be a single submission or a batch of given size. This can be seen in Figure \ref{fig:benchmark-function} at the switch statement for \enquote{mode}. Single submission follows the example given in Section \ref{sec:state:dml}, and we therefore do not go into detail explaining it here. Batch submission works unlike the former. A sequence with specified size is created which tasks are then added to. This sequence is then submitted to the engine similar to the submission of a single descriptor. Further steps then follow the example from Section \ref{sec:state:dml} again. \par
\section{Submission Method} \section{Submission Method}
With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching, single submission and utilizing the \gls{dsa}s queue with multi submission we will evaluate at which data size which submission method makes sense. \par
With each submission, descriptors must be prepared and sent off to the underlying hardware. This is expected to come with a cost, affecting throughput sizes and submission methods differently. By submitting different sizes and comparing batching and single submission, we will evaluate at which data size which submission method makes sense. We expect single submission to perform worse consistently, with a pronounced effect on smaller transfer sizes. This is assumed, as the overhead of a single submission with the \gls{dsa:swq} is incurred for every iteration, while the batch only sees this overhead once for multiple copies. \par
\begin{figure}[h] \begin{figure}[h]
\centering \centering
@ -29,11 +27,9 @@ With each submission, descriptors must be prepared and sent off to the underlyin
\label{fig:perf-submitmethod} \label{fig:perf-submitmethod}
\end{figure} \end{figure}
\todo{maybe remeassure with 8 KiB, 16 KiB as sizes}
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. We assume that high submission cost of the \gls{dsa:swq} cause all but the batch, which only performs one submission for its many descriptors, to suffer. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[p. 6 and 7]{intel:analysis}. \par
In Figure \ref{fig:perf-submitmethod} we conclude that with transfers of 1 MiB and upwards, the submission method makes no noticeable difference. For smaller transfers the performance varies greatly, with batch operations leading in throughput. This is aligned with the finding that \enquote{SWQ observes lower throughput between 1-8 KB [transfer size]} \cite[p. 6 and 7]{intel:analysis} for normal submission method. \par
Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. \par
Another limitation may be observed in this result, namely the inherent throughput limit per \gls{dsa} chip of close to 30 GiB/s. This is apparently caused by I/O fabric limitations \cite[p. 5]{intel:analysis}. We therefore conclude, that the use of multiple \gls{dsa} is required to fully utilize the available bandwidth of \gls{hbm} which theoretically lies at 256 GB/s \cite[Table I]{hbm-arch-paper}. \par
\section{Multithreaded Submission} \section{Multithreaded Submission}

7
thesis/own.gls

@ -116,3 +116,10 @@
first={High Bandwidth Memory (HBM)}, first={High Bandwidth Memory (HBM)},
description={... desc ...} description={... desc ...}
} }
\newglossaryentry{qdp}{
name={QdP},
long={Query-driven Prefetching},
first={Query-driven Prefetching (QdP)},
description={... desc ...}
}
Loading…
Cancel
Save