% geben (für irgendetwas müssen die Betreuer ja auch noch da
% sein).
This bachelor's thesis delves into the evolving landscape of heterogeneous memory systems, marked by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), NUMA, and Remote Memory. These systems demand strategic data location choices for optimal performance, necessitating movement of data between these different storage tiers. Therefore, the burden to maintain optimal location befalls the CPU, reducing cycles available for computation. In response, Intel introduces the Intel Data Streaming Accelerator (Intel DSA), which offloads data operations, presenting an avenue for enhanced efficiency in data-intensive applications. The goal of this thesis is a comprehensive analysis and characterization of the architecture and performance of Intel DSA, along with the application to a domain specific prefetching methodology for accelerating database queries in heterogeneous systems. \par
This bachelor's thesis delves into the evolving landscape of heterogeneous memory systems, marked by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), \glsentryfirst{hbm}, and Remote Memory. These systems demand strategic data location choices for optimal performance, necessitating movement of data between these different storage tiers. Therefore, the burden to maintain optimal location befalls the CPU, reducing cycles available for computation. In response, Intel introduces the \glsentryfirst{dsa}, which offloads data operations, presenting an avenue for enhanced efficiency in data-intensive applications. The goal of this thesis is a comprehensive analysis and characterization of the architecture and performance of \gls{dsa}, along with the application to a domain specific prefetching methodology for accelerating database queries in heterogeneous systems. \par
% den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
% als 10 Seiten liest keiner.
\begin{itemize}
\item Provide an overview of the challenges posed by Non-Uniform Memory Architectures (NUMA) with diverse non-functional properties, such as varying throughput, in modern hardware design.
\item Introduce the Intel Data Streaming Accelerator (Intel DSA) and its role in optimizing streaming data movement operations for various applications.
\item
\item Explain the concept of Query Driven Prefetching (\gls{qdp}) and its significance in optimizing database performance through intelligent prefetching.
\item Clearly state the main objectives of thesis, emphasizing the analysis and characterization of the Intel DSA architecture, and application of dsa to qdp
\item Outline the primary contributions of the work, including the implementation of a cache within \gls{qdp}, the performance and analysis of microbenchmarks
\item Emphasize the significance of these contributions in addressing the challenges of prefetching in a heterogeneous memory system and providing practical insights for other researchers and developers.
\item Provide an overview of the structure, outlining the main sections and their respective content.
\end{itemize}
The development of various technologies like ... brings with it a diverse landscape of systems with differing tiers of main memory. In these systems, moving data between memory classes becomes a necessity to exploit the properties of the different available technologies. As traditionally the CPU is responsible for managing data locality, these heterogeneous memories place a burden on the processor, cutting into cycles which would best be spent on processing. To alleviate the CPU of common data operations, certain Intel Server CPUs now come equipped with the \glsentryfirst{dsa}\cite{intel:xeonbrief}. \par
Alongside Intel DSA, the concept of Query Driven Prefetching (\gls{qdp}) emerges as a strategic methodology for optimizing database performance through intelligent prefetching. This section elaborates on the principles of \gls{qdp}, emphasizing its significance in enhancing database query execution, particularly in heterogeneous memory systems. \par
As a response to these challenges, this thesis navigates through the complex terrain of optimizing data movement operations. Central to this exploration is the introduction of the \gls{dsa}. This technology plays a pivotal role in enhancing streaming data movement operations across various applications. Understanding the architecture and functionality of Intel DSA becomes paramount in the context of addressing the challenges presented by this new form of \gls{numa}. \par
As a response to these challenges, this thesis navigates through the complex terrain of optimizing data movement operations. Central to this exploration is the introduction of the Intel Data Streaming Accelerator (Intel DSA). This groundbreaking technology plays a pivotal role in enhancing streaming data movement operations across various applications. Understanding the architecture and functionality of Intel DSA becomes paramount in the context of addressing the challenges presented by NUMA. \par
The main objectives of this thesis are twofold. Firstly, a comprehensive analysis and characterization of the Intel DSA architecture is undertaken. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \gls{qdp} to accelerate database queries \cite{dimes-prefetching}. This thesis aims to explore how \gls{dsa} can be strategically utilized to address the challenges posed by heterogeneous memory systems, providing insights into the integration of data streaming acceleration with intelligent prefetching. \par
The main objectives of this thesis are twofold. Firstly, a comprehensive analysis and characterization of the Intel DSA architecture is undertaken. This involves a deep dive into the intricacies of Intel DSA and its role in optimizing data movement operations. Secondly, the focus extends to the application of Intel DSA in the domain-specific context of \gls{qdp}. This thesis aims to explore how Intel DSA can be strategically utilized to address the challenges posed by heterogeneous memory systems, providing insights into the integration of data streaming acceleration with intelligent prefetching. \par
This work introduces substantial contributions to the field. The implementation of a cache within the \gls{qdp} framework is a key highlight. Additionally, the thesis includes a detailed examination and analysis of microbenchmarks. These benchmarks serve as a practical guideline, offering insights for optimal application of Intel DSA in diverse scenarios. Aligned with the generic nature of caching, this implementation addresses the specific requirements of \gls{qdp}, offering a nuanced solution to the challenges of prefetching in heterogeneous memory systems. \par
To facilitate a coherent exploration, this thesis is structured into distinct sections. The Technical Background section delves into essential elements, including High Bandwidth Memory (HBM), \gls{qdp}, Intel DSA, the programming interface, and the system setup and configuration. Following this, the Performance Microbenchmarks section meticulously examines benchmarking methodologies, specific benchmarks employed, and a detailed analysis of the results. \par
The subsequent sections, Design and Implementation, unravel the practical aspects of the work. Detailed task descriptions, cache design intricacies within \gls{qdp}, and insights into locking mechanisms and accelerator usage are presented. The Evaluation section provides a comprehensive assessment of the implemented solutions and benchmarks. \par
Finally, the Conclusion and Outlook section wraps up the thesis by summarizing findings, emphasizing conclusions, and offering a glimpse into potential future work and developments in the field. This structured approach ensures a logical flow, enabling readers to navigate seamlessly through the intricate exploration of Intel DSA, \gls{qdp}, and their applications in heterogeneous memory systems. \par
This work introduces substantial contributions to the field. The design and implementation of an offloading cache is a key highlight, presenting an interface for utilizing the strengths of tiered storage with low-effort integration. The Code for this is made available in the accompanying repository \cite{thesis-repo} under \texttt{offloading-cacher}. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These serve as a practical guideline, offering insights for optimal application of \gls{dsa} in diverse scenarios. This thesis, at the time of writing, is the first scientific work to evaluate the \gls{dsa} in a multi-socket system and providing benchmarks for programming through the \glsentryfirst{intel:dml}. Performance for data movement from \glsentryshort{dram} to \gls{hbm} using the \gls{dsa} has also not yet been evaluated by the scientific community. \par
The Technical Background chapter presents the reader with background information relevant for the rest of this work, including \gls{hbm}, \gls{qdp} and the \gls{dsa} with its programming interface. We also provide guidance on system setup and configuration. Following this, the Performance Microbenchmarks section examines the strengths and weaknesses of the \gls{dsa}. We present our methodologies, go into detail on each performed benchmark and conclude by drawing usage guidance from the results. The subsequent sections, Design and Implementation, unravel the practical aspects of the work. We develop the interface and implementation for an offloading cache, giving insights into specific design considerations and implementation challenges. The Evaluation section provides a comprehensive assessment of the implemented solution. In Conclusion, we reflect on insights gained and review the contributions and results of the previous chapters. \par