Browse Source

add todos to unfinished parts of the thesis

master
Constantin Fürst 11 months ago
parent
commit
09a95f1e50
  1. 2
      thesis/content/02_abstract.tex
  2. 24
      thesis/content/10_introduction.tex
  3. 26
      thesis/content/30_performance.tex
  4. 14
      thesis/content/40_design.tex
  5. 4
      thesis/content/60_evaluation.tex

2
thesis/content/02_abstract.tex

@ -10,6 +10,8 @@
\ldots abstract \ldots \ldots abstract \ldots
\todo{write the abstract}
%%% Local Variables: %%% Local Variables:
%%% TeX-master: "diplom" %%% TeX-master: "diplom"
%%% End: %%% End:

24
thesis/content/10_introduction.tex

@ -1,5 +1,5 @@
\chapter{Introduction} \chapter{Introduction}
\label{sec:intro}
\label{chap:intro}
% Die Einleitung schreibt man zuletzt, wenn die Arbeit im Großen und % Die Einleitung schreibt man zuletzt, wenn die Arbeit im Großen und
% Ganzen schon fertig ist. (Wenn man mit der Einleitung beginnt - ein % Ganzen schon fertig ist. (Wenn man mit der Einleitung beginnt - ein
@ -12,28 +12,8 @@
% den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr % den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
% als 10 Seiten liest keiner. % als 10 Seiten liest keiner.
\section{Introduction to Querry driven Prefetching}
\begin{itemize}
\item database context where we have an execution plan for a querry to be executed
\item use knowledge about the querries sub-tasks to determine part of the table which is worth to cache (used multiple times)
\item refer to whitepaper for more information
\end{itemize}
\section{Introduction to Intel Data Streaming Accelerator}
\begin{itemize}
\item
\end{itemize}
\section{Goal Definition}
\begin{itemize}
\item use DSA to offload asynchronous prefetching tasks
\item prefetch into HBM which is smaller (cant hold all data) but faster (used as large cache)
\item effect is lower cpu utilization for copy
\item this allows to focus on actual pipeline execution
\end{itemize}
\todo{write this chapter}
\cleardoublepage \cleardoublepage

26
thesis/content/30_performance.tex

@ -1,5 +1,5 @@
\chapter{Performance Microbenchmarks} \chapter{Performance Microbenchmarks}
\label{sec:perf}
\label{chap:perf}
% Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
% eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
@ -18,18 +18,17 @@
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde. % sein, daß das Abstraktionsniveau verfehlt wurde.
\begin{itemize}
\item
\end{itemize}
\todo{write introductory paragraph}
\section{Benchmarking Methodology} \section{Benchmarking Methodology}
\begin{itemize}
\item
\end{itemize}
\todo{write this section}
\section{Submission Method} \section{Submission Method}
\todo{write this section}
\begin{itemize} \begin{itemize}
\item submit cost analysis: best method and for a subset the point at which submit cost < time savings \item submit cost analysis: best method and for a subset the point at which submit cost < time savings
\item display the full opt-submitmethod graph \item display the full opt-submitmethod graph
@ -42,6 +41,8 @@
\section{Multithreaded Submission} \section{Multithreaded Submission}
\todo{write this section}
\begin{itemize} \begin{itemize}
\item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared \item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
\item show results for all available core counts \item show results for all available core counts
@ -50,8 +51,12 @@
\item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness \item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
\end{itemize} \end{itemize}
\todo{write this section}
\section{Multiple Engines in a Group} \section{Multiple Engines in a Group}
\todo{write this section}
\begin{itemize} \begin{itemize}
\item assumed from arch spec that multiple engines lead to greater Performance \item assumed from arch spec that multiple engines lead to greater Performance
\item reason is that page faults and access latency will be overlapped with preparing the next operation \item reason is that page faults and access latency will be overlapped with preparing the next operation
@ -61,7 +66,9 @@
\item conclusion? \item conclusion?
\end{itemize} \end{itemize}
\section{Data Movement from DDR to HBM}
\section{Data Movement from DDR to HBM} \label{sec:perf:datacopy}
\todo{write this section}
\begin{itemize} \begin{itemize}
\item present two copy methods: smart and brute force \item present two copy methods: smart and brute force
@ -73,6 +80,8 @@
\section{Analysis} \section{Analysis}
\todo{write this section}
\begin{itemize} \begin{itemize}
\item summarize the conclusions and define the point at which dsa makes sense \item summarize the conclusions and define the point at which dsa makes sense
\item minimum transfer size for batch/nonbatch operation \item minimum transfer size for batch/nonbatch operation
@ -82,7 +91,6 @@
\item lower utilization of dsa is good when it will be shared between threads/processes \item lower utilization of dsa is good when it will be shared between threads/processes
\end{itemize} \end{itemize}
\cleardoublepage \cleardoublepage
%%% Local Variables: %%% Local Variables:

14
thesis/content/40_design.tex

@ -1,5 +1,5 @@
\chapter{Design} \chapter{Design}
\label{sec:design}
\label{chap:design}
% Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
% eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
@ -18,8 +18,12 @@
% wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
% sein, daß das Abstraktionsniveau verfehlt wurde. % sein, daß das Abstraktionsniveau verfehlt wurde.
\todo{write introductory paragraph}
\section{Detailed Task Description} \section{Detailed Task Description}
\todo{write this section}
\begin{itemize} \begin{itemize}
\item give slightly more detailed task Description \item give slightly more detailed task Description
\item perspective of "what problems have to be solved" \item perspective of "what problems have to be solved"
@ -36,7 +40,7 @@ To allow rapid integration and ease developer workload, a simple interface was c
As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be refered to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par As caching is performed asynchronously, the user may wish to wait on the operation. This would be beneficial if there are other threads making progress in parallel while the current thread waits on its data becoming available in the faster cache, speeding up local computation. To achieve this, the \texttt{Cache::Access} will return an instance of an object which from hereinafter will be refered to as \texttt{CacheData}. Through \texttt{CacheData::GetDataLocation} a pointer to the cached data will be retrieved, while also providing \texttt{CacheData::WaitOnCompletion} which must only return when the caching operation has completed and during which the current thread is put to sleep, allowing other threads to progress. \par
\subsection{Cache Entry Reuse}
\subsection{Cache Entry Reuse} \label{subsec:design:cache-entry-reuse}
When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par When multiple consumers wish to access the same memory block through the \texttt{Cache}, we could either provide each with their own entry, or share one entry for all consumers. The first option may cause high load on the accelerator due to multiple copy operations being submited and also increases the memory footprint of the system. The latter option requires synchronization and more complex design. As the cache size is restrictive, the latter was chosen. The already existing \texttt{CacheData} will be extended in scope to handle this by allowing copies of it to be created which must synchronize with each other for \texttt{CacheData::WaitOnCompletion} and \texttt{CacheData::GetDataLocation}. \par
@ -56,11 +60,11 @@ Due to its reliance on libnuma for numa awareness, \texttt{Cache} will only work
\subsection{Thread Safety Guarantees} \subsection{Thread Safety Guarantees}
After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{sec:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par
After initialization, all available operations for \texttt{Cache} and \texttt{CacheData} are fully threadsafe but may use locks internally to achieve this. In \ref{chap:implementation} we will go into more detail on how these guarantees are provided and how to optimize the cache for specific use cases that may warrant less restrictive locking. \par
\subsection{Accelerator Usage}
\subsection{Accelerator Usage} \label{subsec:implementation:accel-usage}
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{sec:implementation}.
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}.
\cleardoublepage \cleardoublepage

4
thesis/content/60_evaluation.tex

@ -1,5 +1,5 @@
\chapter{Evaluation} \chapter{Evaluation}
\label{sec:evaluation}
\label{chap:evaluation}
% Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus % Zu jeder Arbeit in unserem Bereich gehört eine Leistungsbewertung. Aus
% diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden, % diesem Kapitel sollte hervorgehen, welche Methoden angewandt worden,
@ -12,6 +12,8 @@
\ldots evaluation \ldots \ldots evaluation \ldots
\todo{write this chapter}
\cleardoublepage \cleardoublepage
%%% Local Variables: %%% Local Variables:

Loading…
Cancel
Save