4 Commits

  1. BIN
      thesis/bachelor.pdf
  2. 2
      thesis/bachelor.tex
  3. 10
      thesis/content/20_state.tex
  4. 4
      thesis/content/60_evaluation.tex

BIN
thesis/bachelor.pdf

2
thesis/bachelor.tex

@ -28,7 +28,7 @@ plainpages=false,pdfpagelabels=true]{hyperref}
\hypersetup{
pdftitle={Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator},
pdfauthor={Anatol Constantin Fürst},
pdfkeywords={intel,dsa,numa,memory,bachelor,thesis},
pdfkeywords={intel,dsa,numa,memory,qdp,prefetching,cache,hbm,data,streaming,accelerator},
}
\input{preamble/packages.tex}

10
thesis/content/20_state.tex

@ -48,19 +48,15 @@ This chapter introduces the relevant technologies and concepts for this thesis.
\label{fig:qdp-simple-query}
\end{figure}
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory during the execution of the pipeline. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par
\gls{qdp} introduces a targeted strategy for optimizing database performance by intelligently prefetching relevant data. To achieve this, \gls{qdp} analyses queries, splitting them into distinct sub-tasks, resulting in the so-called query execution plan. An example of a query and a corresponding plan is depicted in Figure \ref{fig:qdp-simple-query}. From this plan, \gls{qdp} determines columns in the database used in subsequent tasks. Once identified, the system proactively copies these columns into faster memory. For the example (Figure \ref{fig:qdp-simple-query}), column \texttt{b} is accessed in \(SCAN_b\) and \(G_{sum(b)}\) and column \texttt{a} is only accessed for \(SCAN_a\). Therefore, only column \texttt{b} will be chosen for prefetching in this scenario. \cite{dimes-prefetching} \par
Applying pipelining, \gls{qdp} processes tasks in parallel and in chunks. Therefore, a high degree of concurrency may be observed, resulting in demand for CPU cycles and memory bandwidth. As prefetching takes place in parallel with query processing, it creates additional CPU load, potentially diminishing gains from the acceleration of subsequent steps through the cached data. For this reason, we intend to offload copy operations to the \gls{dsa} in this work, reducing the CPU impact and thereby increasing the performance gains offered by prefetching. \par
\section{\glsentrylong{dsa}}
\label{sec:state:dsa}
\blockquote{Intel \gls{dsa} is a high-performance data copy and transformation accelerator that will be integrated in future Intel® processors, targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications. \cite[Ch. 1]{intel:dsaspec}}
Introduced with the \(4^{th}\) generation of Intel Xeon Scalable Processors, the \gls{dsa} aims to relieve the CPU from \enquote{common storage functions and operations such as data integrity checks and deduplication} \cite[p. 4]{intel:xeonbrief}. To fully utilize the hardware, a thorough understanding of its workings is essential. Therefore, we present an overview of the architecture, software, and the interaction of these two components, delving into the architectural details of the \gls{dsa} itself. All statements are based on Chapter 3 of the Architecture Specification by Intel. \par
\todo{comment by andre: Wirkt mir etwas knapp, vllt etwas mehr erklären. Z.B.: auch die Problematik mit dem Aufwand um das Kopieren schnell genug zu machen (viele Threads) -> das gibt dann auch den Ansatz punkt um zu sagen, was wir uns von der DSA erhoffen (Kontext, warum wir QdP überhaupt einführen). Und z.B.: welchen Trick wir angewandt haben um den Kopieraufwand gering zu halten.}
\todo{zeigen wo ich mit der dsa ansetze}
\subsection{Hardware Architecture} \label{subsection:dsa-hwarch}
\begin{figure}[!t]

4
thesis/content/60_evaluation.tex

@ -103,9 +103,9 @@ Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, wh
\section{Discussion}
In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes.
In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes. \par
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (refer to Table \ref{table:qdp-speedup}).
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (refer to Table \ref{table:qdp-speedup}). \par
As stated in Section \ref{sec:design:cache}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as Non-Volatile or Remote Memory, were not considered, as mentioned in Chapter \ref{chap:intro}. Despite the public functions of the \texttt{Cache} being named with cache usage in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{sec:design:accel-usage}. Potential applications include background copying of data from remote locations into faster local memory for computation or replication to non-volatile memory for data loss prevention. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par

Loading…
Cancel
Save