write a chunk of implementation chapter

11 months ago · 3301b17497
3 changed files with 49 additions and 2 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@ -1,5 +1,5 @@
 \chapter{Implementation}
 \label{sec:implementation}
 \label{chap:implementation}
 % Hier greift man einige wenige, interessante Gesichtspunkte der
 % Implementierung heraus. Das Kapitel darf nicht mit Dokumentation oder
@ -20,7 +20,33 @@
 % nichts zu suchen, auch nicht im Anhang, sondern gehören auf Rechner,
 % auf denen man sie sich ansehen kann.
 \ldots implementation \ldots
 \todo{write introductory paragraph}
 \section{Locking and Usage of Atomics}
 As the usage of locking and atomics may have a significant impact on performance, their application will be discussed in detail within this section. \todo{extend introductory paragraph} \par
 \subsection{Cache State Lock}
 To keep track of the current cache state, a map is used internally which associates a memory address to a \texttt{CacheData} instance. In \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers, requiring thread safety when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache, which however is a rather short operation. \par
 In scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads accessing large amounts of data, leading to high memory pressure, lock contention around this lock will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
 \subsection{CacheData Reference Counting}
 \todo{write this section}
 \subsection{CacheData WaitOnCompletion}
 \todo{write this section}
 \subsection{Performance Guideline}
 The performance impact of lock contention \todo{find a reference for this and use above too} and atomic synchronization \todo{find a reference for this and use above too} is not to be taken lightly, as \texttt{Cache} may be used in performance critical systems. Reducing their impact is therefore desireable which can be achieved in multiple ways. The easiest is to have one instance of \texttt{Cache} per \gls{numa:node} which reduces both lock contention by just serving less threads and atomic synchronization as the atomics are shared between physically close cpu cores \todo{find a reference that shows that physical distance affects sync cost}. This requires no code modification but does not inherently reduce the amount of synchronization taking place. To achieve this reduction, restrictions must be placed upon the thread safety or access guarantees, which is not sensible for this generic implementation.
 \section{Accelerator Usage}
 After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node} \(3 + 8 == 11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies under 64 MiB will be handled exclusively by the current node. This size is quite high but due to the overhead of assigning the current thread to the selected nodes, using only the current one is more efficient. This assignment is required due to \gls{intel:dml} not being \gls{numa} aware and therefore assigning submissions only to the \gls{dsa} engine present on the node that the calling thread is assigned to \cite{intel:dmldoc}. \todo{Actually test which size makes sense, due to numa-setaffinity and scheduling overhead this will probably be much higher}
 \cleardoublepage
--- a/thesis/own.gls
+++ b/thesis/own.gls
@ -94,4 +94,25 @@
    long={Intel Data Mover Library},
    first={Intel Data Mover Library (Intel DML)},
    description={... desc ...}
 }
 \newglossaryentry{numa}{
    name={NUMA},
    long={Non Uniform Memory Architecture},
    first={Non Uniform Memory Architecture (NUMA)},
    description={... desc ...}
 }
 \newglossaryentry{numa:node}{
    name={Node},
    long={NUMA-Node},
    first={NUMA-Node (Node)},
    description={... desc ...}
 }
 \newglossaryentry{hbm}{
    name={HBM},
    long={High Bandwidth Memory},
    first={High Bandwidth Memory (HBM)},
    description={... desc ...}
 }