2 Commits

  1. BIN
      thesis/bachelor.pdf
  2. 2
      thesis/content/40_design.tex
  3. 13
      thesis/content/60_evaluation.tex

BIN
thesis/bachelor.pdf

2
thesis/content/40_design.tex

@ -69,6 +69,8 @@ Due to its reliance on libnuma for memory allocation, \texttt{Cache} is exclusiv
Compared with the challenges of ensuring correct entry lifetime and thread safety, the application of \gls{dsa} for the task of duplicating data is relatively straightforward, thanks in part to \gls{intel:dml} \cite{intel:dmldoc}. Upon a call to \texttt{Cache::Access} and determining that the given memory pointer is not present in the cache, work is submitted to the accelerator. However, before proceeding, the desired location for the cache entry must be determined which the user-defined cache placement policy function handles. Once the desired placement is obtained, the copy policy then determines, which nodes should participate in the copy operation. Following Section \ref{subsection:dsa-hwarch}, this is equivalent to selecting the accelerators. The copy tasks are distributed across the participating nodes. As the choice of cache placement and copy policy is user-defined, one possibility will be discussed in Chapter \ref{chap:implementation}. \par
\todo{modify naming, possible outsource policy functions, mention memory policy}
%%% Local Variables:
%%% TeX-master: "diplom"
%%% End:

13
thesis/content/60_evaluation.tex

@ -22,7 +22,7 @@ The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their t
\section{Expectations}
\label{sec:eval:expectations}
The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that the \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly diminishes the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}.
The simple query presents a challenging scenario for the \texttt{Cache}. The execution time for the filter operation applied to column \texttt{a} is expected to be brief. Consequently, the \texttt{Cache} has limited time for prefetching, which may exacerbate delays caused by processing overhead in the \texttt{Cache} or during accelerator offload. Furthermore, it can be assumed that the \(SCAN_a\) is memory-bound by itself. Since the prefetching of \texttt{b} in \(SCAN_b\) and the loading and subsequent filtering of \texttt{a} occur concurrently, caching directly diminishes the memory bandwidth available to \(SCAN_a\) when both columns are located on the same \gls{numa:node}. \par
\section{Observations}
\label{sec:eval:observations}
@ -103,13 +103,12 @@ Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, wh
\section{Discussion}
\begin{itemize}
\item Is the test a good choice? (yes, shows worst case with memory bound application and increases effects of overhead)
\item Are there other use cases for the cache? (yes, applications that are CPU bound -> DSA frees up CPU cycles)
\item Can we assume that it is possible to distribute columns over multiple nodes?
\end{itemize}
In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes.
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (refer to Table \ref{table:qdp-speedup}).
As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as Non-Volatile or Remote Memory, were not considered, as mentioned in Chapter \ref{chap:intro}. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to non-volatile memory for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
\todo{write this section}
%%% Local Variables:
%%% TeX-master: "diplom"

Loading…
Cancel
Save