Browse Source

start processing todos for abstract,intro and state | introduce glossary entries for nvram and remote mem, add image for hbm design layout, other smaller changes like rewrites

master
Constantin Fürst 3 months ago
parent
commit
ce15b7e203
  1. 4
      thesis/content/02_abstract.tex
  2. 8
      thesis/content/10_introduction.tex
  3. 24
      thesis/content/20_state.tex
  4. BIN
      thesis/images/hbm-design-layout.png
  5. 27
      thesis/own.bib
  6. 17
      thesis/own.gls

4
thesis/content/02_abstract.tex

@ -8,9 +8,7 @@
% geben (für irgendetwas müssen die Betreuer ja auch noch da
% sein).
This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile implementation of a cache, offloading copy operations to the DSA.
\todo{double "offloading" maybe reformulate the last sentence}
This bachelor's thesis explores data locality in heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. Systems equipped with more than one type of main memory necessitate strategic decisions regarding data placement to take advantage of the properties of the different storage tiers. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems. We contribute a versatile library, capable of performing caching, data replication and prefetching asynchronously, accelerated by the \gls{dsa}.
%%% Local Variables:
%%% TeX-master: "diplom"

8
thesis/content/10_introduction.tex

@ -12,15 +12,13 @@
% den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
% als 10 Seiten liest keiner.
The proliferation of various technologies, such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}. \par
The proliferation of various technologies, such as \gls{nvram}, \gls{hbm}, and \gls{remotemem}, has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. The responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. To mitigate this strain, certain current-generation Intel server processors feature the \glsentryfirst{dsa}, to which certain data operations may be offloaded \cite{intel:xeonbrief}. This thesis undertakes the challenge of optimizing data locality in heterogeneous memory architectures, utilizing the \gls{dsa}. \par
The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the architecture of the Intel \gls{dsa}. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. \par
This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. Its design and implementation make up a large part of this work. This resulted in an architecture applicable to any use case requiring \glsentryshort{numa}-aware data movement with offloading support to the \gls{dsa}, while giving thread safety guarantees. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. As of the time of writing, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system and provide benchmarks for programming through the \glsentryfirst{intel:dml}. Furthermore, performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm} using \gls{dsa} has, to our knowledge, not yet been evaluated by the scientific community. \par
This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. Its design and implementation make up a large part of this work. This resulted in an architecture applicable to any use case requiring \glsentryshort{numa}-aware data movement with offloading support to the \gls{dsa}. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. To our knowledge, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system, provide benchmarks for programming through the \glsentryfirst{intel:dml} and evaluate performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm}. \par
We begin the work by furnishing the reader with pertinent technical information necessary for understanding the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Methodologies are presented, each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par
\todo{shorten intro by two lines so that it fits on one page}
We begin the work by furnishing the reader with pertinent technical information necessary for understanding the subsequent sections of this work in Chapter \ref{chap:state}. Background is given for \gls{hbm} and \gls{qdp}, followed by a detailed account of the \gls{dsa}s architecture along with an available programming interface. Additionally, guidance on system setup and configuration is provided. Subsequently, Chapter \ref{chap:perf} analyses the strengths and weaknesses of the \gls{dsa} through microbenchmarks. Each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. Chapters \ref{chap:design} and \ref{chap:implementation} elucidate the practical aspects of the work, including the development of the interface and implementation of the cache, shedding light on specific design considerations and implementation challenges. We comprehensively assess the implemented solution by providing concrete data on gains for an exemplary database query in Chapter \ref{chap:evaluation}. Finally, Chapter \ref{chap:conclusion} reflects insights gained, and presents a review of the contributions and results of the preceding chapters. \par
%%% Local Variables:
%%% TeX-master: "diplom"

24
thesis/content/20_state.tex

@ -34,10 +34,14 @@ This chapter introduces the relevant technologies and concepts for this thesis.
\section{\glsentrylong{hbm}}
\label{sec:state:hbm}
\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. It consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, HBM Flat Mode and HBM Cache Mode \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\begin{figure}[!t]
\centering
\includegraphics[width=0.9\textwidth]{images/hbm-design-layout.png}
\caption{\glsentrylong{hbm} Design Layout. Shows that an \glsentryshort{hbm}-module consists of stacked DRAM and a logic die. \cite{amd:hbmoverview}}
\label{fig:hbm-layout}
\end{figure}
\todo{find a nice graphic for hbm}
\todo{potential for futurework: analyze performance of using cache-mode}
\glsentrylong{hbm} is an emerging memory technology that promises an increase in peak bandwidth. As visible in Figure \ref{fig:hbm-layout}, it consists of stacked \glsentryshort{dram} dies \cite[p. 1]{hbm-arch-paper} and is gradually being integrated into server processors, with the Intel® Xeon® Max Series \cite{intel:xeonmaxbrief} being one recent example. \gls{hbm} on these systems can be configured in different memory modes, most notably, \enquote{HBM Flat Mode} and \enquote{HBM Cache Mode} \cite{intel:xeonmaxbrief}. The former gives applications direct control, requiring code changes, while the latter utilizes the \gls{hbm} as a cache for the system's \glsentryshort{dram}-based main memory \cite{intel:xeonmaxbrief}. \par
\section{\glsentrylong{qdp}}
\label{sec:state:qdp}
@ -81,7 +85,7 @@ The \gls{dsa} chip is directly integrated into the processor and attaches via th
\subsubsection{Virtual Address Resolution}
An important aspect of modern computer systems is the separation of address spaces through virtual memory \todo{maybe add cite and reformulate}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory \todo{of what}, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
An important aspect of computer systems is the abstraction of physical memory addresses through virtual memory \cite{virtual-memory}. Therefore, the \gls{dsa} must handle address translation because a process submitting a task will not know the physical location in memory of its data, causing the descriptor to contain virtual addresses. To resolve these to physical addresses, the Engine communicates with the \gls{iommu} to perform this operation, as visible in the outward connections at the top of Figure \ref{fig:dsa-internal-block}. Knowledge about the submitting processes is required for this resolution. Therefore, each task descriptor has a field for the \gls{x86:pasid} which is filled by the \gls{x86:enqcmd} instruction for a \gls{dsa:swq} \cite[Sec. 3.3.1]{intel:dsaspec} or set statically after a process is attached to a \gls{dsa:dwq} \cite[Sec. 3.3.2]{intel:dsaspec}. \par
\subsubsection{Completion Signalling}
\label{subsubsec:state:completion-signal}
@ -102,13 +106,9 @@ Ordering of operations is only guaranteed for a configuration with one \gls{dsa:
\label{fig:dsa-software-arch}
\end{figure}
Since Linux Kernel 5.10, there exists a driver for the \gls{dsa} which has no counterpart in the Windows OS-Family \cite[Sec. Installation]{intel:dmldoc} and other operating systems. Therefore, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel's accel-config \cite{intel:libaccel-config-repo} user-space toolset can be utilized. This application provides a command-line interface and can read configuration files to set up the device. The interaction is depicted in the upper block titled \enquote{User space} in Figure \ref{fig:dsa-software-arch}. It interacts with the kernel driver, visible in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. After successful configuration, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
With the appropriate file permissions, a process could submit work to the \gls{dsa} using either the \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing the descriptors by manual configuration. However, this process can be cumbersome, which is why \gls{intel:dml} exists. \par
With some limitations, like lacking support for \gls{dsa:dwq} submission, this library presents an interface that takes care of creation and submission of descriptors, and error handling and reporting. Thanks to the high-level-view the code may choose a different execution path at runtime which allows the memory operations to either be executed in hardware or software. The former on an accelerator or the latter using equivalent instructions provided by the library. This makes code using this library automatically compatible with systems that do not provide hardware support. \cite[Sec. Introduction]{intel:dmldoc} \par
Since the Linux Kernel version 5.10, a driver for the \gls{dsa} has been available, which currently lacks a counterpart on Windows Operating Systems \cite[Sec. Installation]{intel:dmldoc}. As a result, accessing the \gls{dsa} is only possible under Linux. To interact with the driver and perform configuration operations, Intel provides the accel-config user-space application \cite{intel:libaccel-config-repo}. This toolset offers a command-line interface and can read configuration files to configure the device, as mentioned in Section \ref{subsection:dsa-hwarch}. The interaction is illustrated in the upper block labelled \enquote{User space} in Figure \ref{fig:dsa-software-arch}, where it communicates with the kernel driver, depicted in light green and labelled \enquote{IDXD} in Figure \ref{fig:dsa-software-arch}. Once successfully configured, each \gls{dsa:wq} is exposed as a character device through \texttt{mmap} of the associated portal \cite[Sec. 3.3]{intel:analysis}. \par
\todo{reformulate section}
While a process could theoretically submit work to the \gls{dsa} using either the \gls{x86:movdir64b} or \gls{x86:enqcmd} instructions, providing descriptors through manual configuration, this approach can be cumbersome. Hence, \gls{intel:dml} exists to streamline this process. Despite some limitations, such as the lack of support for \gls{dsa:dwq} submission, this library offers an interface that manages the creation and submission of descriptors, as well as error handling and reporting. The high-level abstraction offered, enables compatibility measures, allowing code developed for the \gls{dsa} to also execute on machines without the required hardware \cite[Sec. High-level C++ API, Advanced usage]{intel:dmldoc}. \par
\section{Programming Interface for \glsentrylong{dsa}}
\label{sec:state:dml}
@ -126,13 +126,13 @@ In the function header of Figure \ref{fig:dml-memcpy} two differences from stand
The \texttt{path} parameter allows the selection of the executing device, which can be either the CPU or \gls{dsa}. The options include \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}), and \texttt{dml::automatic}, where the latter dynamically selects the device at runtime, favoring \gls{dsa} over CPU execution \cite[Sec. Quick Start]{intel:dmldoc}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Section \ref{subsection:dsa-hwarch} \todo{not really observed there, cite maybe}, the node ID is equivalent to the ID of the \gls{dsa}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Section \ref{subsec:perf:datacopy}. This can either be achieved by pinning the current thread to the \gls{numa:node} that the device is located on, or, or by using optional parameters of \texttt{dml::submit} \cite[Sec. High-level C++ API, NUMA support]{intel:dmldoc}. As evident from Figure \ref{fig:dml-memcpy}, we chose the former option for this example, using \texttt{numa\_run\_on\_node} to restrict the current thread to run on the given node. With it only being an example, potential side effects, arising from modification of \glsentryshort{numa}-assignment, of calling this pseudocode are not relevant. \par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc} \par
In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par
A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}
A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc} \cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}. \par
\section{System Setup and Configuration} \label{sec:state:setup-and-config}

BIN
thesis/images/hbm-design-layout.png

After

Width: 1260  |  Height: 709  |  Size: 221 KiB

27
thesis/own.bib

@ -234,3 +234,30 @@
howpublished = {\url{https://timur.audio/dwcas-in-c}},
urldate = {2024-02-07}
}
@misc{amd:hbmoverview,
author = {AMD},
title = {{High-Bandwidth Memory (HBM)}},
urldate = {2024-02-14},
howpublished = {\url{https://www.amd.com/system/files/documents/high-bandwidth-memory-hbm.pdf}}
}
@article{virtual-memory,
author = {Peter J. Denning},
title = {{Virtual Memory}},
date = {1996-03},
publisher = {Association for Computing Machinery},
volume = {28},
number = {1},
doi = {10.1145/234313.234403},
journal = {ACM Computing Surveys},
pages = {213–216},
numpages = {4}
}
@misc{intel:xeonmax-ark,
author = {Intel},
title = {{Intel® Xeon® CPU Max 9468 Processor}},
urldate = {2024-02-14},
howpublished = {\url{https://ark.intel.com/content/www/us/en/ark/products/232596/intel-xeon-cpu-max-9468-processor-105m-cache-2-10-ghz.html}}
}

17
thesis/own.gls

@ -135,7 +135,8 @@
}
\newglossaryentry{mempress}{
name={memory pressure},
short={memory pressure},
name={Memory Pressure},
description={... desc ...}
}
@ -145,4 +146,18 @@
long={Application Programming Interface},
first={Application Programming Interface (API)},
description={... desc ...}
}
\newglossaryentry{remotemem}{
short={Remote Memory},
name={Remote Memory},
description={... desc ...}
}
\newglossaryentry{nvram}{
short={NVRAM},
name={NVRAM},
long={Non-Volatile RAM},
first={Non-Volatile RAM (NVRAM)},
description={... desc ...}
}
Loading…
Cancel
Save