Browse Source

continue writing the implementation chapter

master
Constantin Fürst 11 months ago
parent
commit
627ab7eec6
  1. 4
      thesis/content/40_design.tex
  2. 9
      thesis/content/50_implementation.tex
  3. 13
      thesis/own.bib

4
thesis/content/40_design.tex

@ -56,7 +56,7 @@ Firstly, overlapping areas in the cache will cause undefined behaviour during in
Secondly, invalidation is to be performed manually, requiring the programmer to remember which points of data are at any given point in time cached and invalidating them upon modification. No ordering guarantees will be given for this situation, possibly leading to threads still having a pointer to now-outdated entries and continuing their progress with this. \par Secondly, invalidation is to be performed manually, requiring the programmer to remember which points of data are at any given point in time cached and invalidating them upon modification. No ordering guarantees will be given for this situation, possibly leading to threads still having a pointer to now-outdated entries and continuing their progress with this. \par
Due to its reliance on libnuma for numa awareness, \texttt{Cache} will only work on systems where this library is present, excluding, most notably, Windows from the compatibility list. \par
Due to its reliance on libnuma for memory allocation and thread pinning, \texttt{Cache} will only work on systems where this library is present, excluding, most notably, Windows from the compatibility list. \par
\subsection{Thread Safety Guarantees} \subsection{Thread Safety Guarantees}
@ -64,7 +64,7 @@ After initialization, all available operations for \texttt{Cache} and \texttt{Ca
\subsection{Accelerator Usage} \label{subsec:implementation:accel-usage} \subsection{Accelerator Usage} \label{subsec:implementation:accel-usage}
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}.
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is simple, thanks partly to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in cache, work will be submitted to the Accelerator. Before, however, the desired location must be determined which the user-defined cache placement policy function handles. With the desired placement obtained, the copy policy function then determines, which nodes should take part in the copy operation which is equivalent to selecting the Accelerators following \ref{subsection:dsa-hwarch}. This causes the work to be split upon the available accelerators to which the work descriptors are submitted at this time. The handlers that \gls{intel:dml} \cite{intel:dmldoc} provides will then be moved to the \texttt{CacheData} instance to permit the callee to wait upon caching completion. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in \ref{chap:implementation}. \par
\cleardoublepage \cleardoublepage

9
thesis/content/50_implementation.tex

@ -28,6 +28,8 @@ As the usage of locking and atomics may have a significant impact on performance
\subsection{Cache State Lock} \subsection{Cache State Lock}
To keep track of the current cache state,
To keep track of the current cache state, a map is used internally which associates a memory address to a \texttt{CacheData} instance. In \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers, requiring thread safety when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache, which however is a rather short operation. \par To keep track of the current cache state, a map is used internally which associates a memory address to a \texttt{CacheData} instance. In \ref{subsec:design:cache-entry-reuse} we decided to reuse one cache entry for multiple consumers, requiring thread safety when accessing and extending the cache state in \texttt{Cache::Access}, \texttt{Cache::Flush} and \texttt{Cache::Clear}. The latter two both require a unique lock, preventing other calls to \texttt{Cache} from making progress while the operation is being processed. For \texttt{Cache::Access} the use of locking depends upon the caches state. At first only a shared lock is acquired for checking whether the given address already resides in cache, allowing other \texttt{Cache::Access}-operations to also perform this check. If no entry for the region is present, a unique lock is required as well when adding the newly created entry to cache, which however is a rather short operation. \par
In scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads accessing large amounts of data, leading to high memory pressure, lock contention around this lock will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par In scenarios where the \texttt{Cache} is frequently tasked with flushing and re-caching by multiple threads accessing large amounts of data, leading to high memory pressure, lock contention around this lock will negatively impact performance by delaying cache access. Due to passive waiting, this impact might be less noticeable when other threads on the system are able to make progress during the wait. \par
@ -42,11 +44,16 @@ In scenarios where the \texttt{Cache} is frequently tasked with flushing and re-
\subsection{Performance Guideline} \subsection{Performance Guideline}
Atomic operations come with an added performance penalty. No recent studies were found on this, although we assume that the findings of Hermann Schweizer et al. in "Evaluating the Cost of Atomic Operations on Modern Architectures" \cite{atomics-cost-analysis} still hold true for todays processor architectures. Due to the inherent cache synchronization mechanisms in place \cite[Subsection IV.A.3 Off-Die Access]{atomics-cost-analysis}, they observed significant access latency increase depending on whether the atomic variable was located on the local core, on a different core on the same chip or on another socket \cite[Fig. 4]{atomics-cost-analysis}.
The performance impact of lock contention \todo{find a reference for this and use above too} and atomic synchronization \todo{find a reference for this and use above too} is not to be taken lightly, as \texttt{Cache} may be used in performance critical systems. Reducing their impact is therefore desireable which can be achieved in multiple ways. The easiest is to have one instance of \texttt{Cache} per \gls{numa:node} which reduces both lock contention by just serving less threads and atomic synchronization as the atomics are shared between physically close cpu cores \todo{find a reference that shows that physical distance affects sync cost}. This requires no code modification but does not inherently reduce the amount of synchronization taking place. To achieve this reduction, restrictions must be placed upon the thread safety or access guarantees, which is not sensible for this generic implementation. The performance impact of lock contention \todo{find a reference for this and use above too} and atomic synchronization \todo{find a reference for this and use above too} is not to be taken lightly, as \texttt{Cache} may be used in performance critical systems. Reducing their impact is therefore desireable which can be achieved in multiple ways. The easiest is to have one instance of \texttt{Cache} per \gls{numa:node} which reduces both lock contention by just serving less threads and atomic synchronization as the atomics are shared between physically close cpu cores \todo{find a reference that shows that physical distance affects sync cost}. This requires no code modification but does not inherently reduce the amount of synchronization taking place. To achieve this reduction, restrictions must be placed upon the thread safety or access guarantees, which is not sensible for this generic implementation.
\section{Accelerator Usage} \section{Accelerator Usage}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node} \(3 + 8 == 11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies under 64 MiB will be handled exclusively by the current node. This size is quite high but due to the overhead of assigning the current thread to the selected nodes, using only the current one is more efficient. This assignment is required due to \gls{intel:dml} not being \gls{numa} aware and therefore assigning submissions only to the \gls{dsa} engine present on the node that the calling thread is assigned to \cite{intel:dmldoc}. \todo{Actually test which size makes sense, due to numa-setaffinity and scheduling overhead this will probably be much higher}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node} \(3 + 8 == 11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies under 64 MiB will be handled exclusively by the current node. This size is quite high but due to the overhead of assigning the current thread to the selected nodes, using only the current one is more efficient. This assignment is required due to \gls{intel:dml} not being \gls{numa} aware and therefore assigning submissions only to the \gls{dsa} engine present on the node that the calling thread is assigned to \cite[Section "NUMA support"]{intel:dmldoc}. \todo{Actually test which size makes sense, due to numa-setaffinity and scheduling overhead this will probably be much higher}
\cleardoublepage \cleardoublepage

13
thesis/own.bib

@ -44,7 +44,7 @@
author = {Intel}, author = {Intel},
title = {{Intel Data Mover Library Documentation}}, title = {{Intel Data Mover Library Documentation}},
publisher = {GitHub}, publisher = {GitHub},
howpublished = {\url{https://intel.github.io/DML/index.html}},
howpublished = {\url{https://intel.github.io/DML/documentation/api_docs/high_level_api.html}},
urldate = {2024-01-07} urldate = {2024-01-07}
} }
@ -55,3 +55,14 @@
url = {https://arxiv.org/pdf/2305.02480.pdf}, url = {https://arxiv.org/pdf/2305.02480.pdf},
urldate = {2024-01-07} urldate = {2024-01-07}
} }
@INPROCEEDINGS{atomics-cost-analysis,
author={Schweizer, Hermann and Besta, Maciej and Hoefler, Torsten},
booktitle={2015 International Conference on Parallel Architecture and Compilation (PACT)},
title={Evaluating the Cost of Atomic Operations on Modern Architectures},
year={2015},
volume={},
number={},
pages={445-456},
doi={10.1109/PACT.2015.24}
}
Loading…
Cancel
Save