% geben (für irgendetwas müssen die Betreuer ja auch noch da
% sein).
This bachelor's thesis delves into the evolving landscape of heterogeneous memory systems, marked by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), \glsentryfirst{hbm}, and Remote Memory. These systems demand strategic data location choices for optimal performance, necessitating movement of data between these different storage tiers. Therefore, the burden to maintain optimal location befalls the CPU, reducing cycles available for computation. In response, Intel introduces the \glsentryfirst{dsa}, which offloads data operations, presenting an avenue for enhanced efficiency in data-intensive applications. The goal of this thesis is a comprehensive analysis and characterization of the architecture and performance of \gls{dsa}, along with the application to a domain specific prefetching methodology for accelerating database queries in heterogeneous systems. \par
This bachelor's thesis explores the dynamic landscape of heterogeneous memory systems, characterized by advancements in main memory technologies such as Non-Volatile RAM (NVRAM), High Bandwidth Memory (HBM), and Remote Memory. These systems necessitate strategic decisions regarding data placement to optimize performance, requiring the movement of data across different storage tiers. Consequently, the responsibility for maintaining optimal data placement falls upon the CPU, resulting in a reduction of available cycles for computational tasks. In response to this challenge, Intel has introduced the Data Streaming Accelerator (DSA), which offloads data operations, offering a potential avenue for enhancing efficiency in data-intensive applications. The primary objective of this thesis is to provide a comprehensive analysis and characterization of the architecture and performance of the DSA, along with its application to a domain-specific prefetching methodology aimed at accelerating database queries within heterogeneous memory systems.
% den Rest der Arbeit. Meist braucht man mindestens 4 Seiten dafür, mehr
% als 10 Seiten liest keiner.
The development of various technologies like ... brings with it a diverse landscape of systems with differing tiers of main memory. In these systems, moving data between memory classes becomes a necessity to exploit the properties of the different available technologies. As traditionally the CPU is responsible for managing data locality, these heterogeneous memories place a burden on the processor, cutting into cycles which would best be spent on processing. To alleviate the CPU of common data operations, certain Intel Server CPUs now come equipped with the \glsentryfirst{dsa}\cite{intel:xeonbrief}. \par
The proliferation of various technologies, such as..., has ushered in a diverse landscape of systems characterized by varying tiers of main memory. Within these systems, the movement of data across memory classes becomes imperative to leverage the distinct properties offered by the available technologies. Traditionally tasked with managing data locality, the CPU faces an added burden in heterogeneous memory environments, thereby diminishing available processing cycles. To mitigate this strain on the CPU, certain Intel Server CPUs now feature the \glsentryfirst{dsa}\cite{intel:xeonbrief}. \par
As a response to these challenges, this thesis navigates through the complex terrain of optimizing data movement operations. Central to this exploration is the introduction of the \gls{dsa}. This technology plays a pivotal role in enhancing streaming data movement operations across various applications. Understanding the architecture and functionality of Intel DSA becomes paramount in the context of addressing the challenges presented by this new form of \gls{numa}. \par
In response to these challenges, this thesis undertakes the intricate task of optimizing data movement operations. At the core of this endeavor lies the introduction of the \gls{dsa}, which plays a pivotal role in enhancing streaming data movement operations across diverse applications. A thorough understanding of the architecture and functionality of \gls{dsa} is essential in addressing the challenges posed by this new form of \gls{numa}. \par
The main objectives of this thesis are twofold. Firstly, a comprehensive analysis and characterization of the Intel DSA architecture is undertaken. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \gls{qdp} to accelerate database queries \cite{dimes-prefetching}. This thesis aims to explore how \gls{dsa} can be strategically utilized to address the challenges posed by heterogeneous memory systems, providing insights into the integration of data streaming acceleration with intelligent prefetching. \par
The primary objectives of this thesis are twofold. Firstly, it involves a comprehensive analysis and characterization of the Intel \gls{dsa} architecture. Secondly, the focus extends to the application of \gls{dsa} in the domain-specific context of \glsentryfirst{qdp} to accelerate database queries \cite{dimes-prefetching}. This thesis seeks to explore how \gls{dsa} can be strategically utilized to tackle the challenges posed by heterogeneous memory systems, offering insights into the integration of data streaming acceleration with intelligent prefetching. \par
This work introduces substantial contributions to the field. The design and implementation of an offloading cache is a key highlight, presenting an interface for utilizing the strengths of tiered storage with low-effort integration. The Code for this is made available in the accompanying repository \cite{thesis-repo} under \texttt{offloading-cacher}. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These serve as a practical guideline, offering insights for optimal application of \gls{dsa} in diverse scenarios. This thesis, at the time of writing, is the first scientific work to evaluate the \gls{dsa} in a multi-socket system and providing benchmarks for programming through the \glsentryfirst{intel:dml}. Performance for data movement from \glsentryshort{dram} to \gls{hbm} using the\gls{dsa} has also not yet been evaluated by the scientific community. \par
This work introduces significant contributions to the field. Notably, the design and implementation of an offloading cache represent a key highlight, providing an interface for leveraging the strengths of tiered storage with minimal integration efforts. The code for this is made available in the accompanying repository \cite{thesis-repo} under 'offloading-cacher'. Additionally, the thesis includes a detailed examination and analysis of the strengths and weaknesses of the \gls{dsa} through microbenchmarks. These benchmarks serve as practical guidelines, offering insights for the optimal application of \gls{dsa} in various scenarios. As of the time of writing, this thesis stands as the first scientific work to extensively evaluate the \gls{dsa} in a multi-socket system and provide benchmarks for programming through the \glsentryfirst{dml}. Furthermore, performance for data movement from \glsentryshort{dram} to \glsentryfirst{hbm} using \gls{dsa} has not yet been evaluated by the scientific community. \par
The Technical Background chapter presents the reader with background information relevant for the rest of this work, including \gls{hbm}, \gls{qdp} and the\gls{dsa}with its programming interface. We also provide guidance on system setup and configuration. Following this, the Performance Microbenchmarks section examines the strengths and weaknesses of the \gls{dsa}. We present our methodologies, go into detail on each performed benchmark and conclude by drawing usage guidance from the results. The subsequent sections, Design and Implementation, unravel the practical aspects of the work. We develop the interface and implementation for an offloading cache, giving insights into specific design considerations and implementation challenges. The Evaluation section provides a comprehensive assessment of the implemented solution. In Conclusion, we reflect on insights gained and review the contributions and results of the previous chapters. \par
The Technical Background chapter furnishes the reader with pertinent background information necessary for understanding the subsequent sections of this work, encompassing \gls{hbm}, \gls{qdp}, and \gls{dsa}along with its programming interface \cite{intel:dmldoc}. Additionally, guidance on system setup and configuration is provided. Subsequently, the Performance Microbenchmarks section analyzes the strengths and weaknesses of the \gls{dsa}. Methodologies are presented, each benchmark is elaborated upon in detail, and usage guidance is drawn from the results. The following sections, Design and Implementation, elucidate the practical aspects of the work, including the development of the interface and implementation for an offloading cache, shedding light on specific design considerations and implementation challenges. The Evaluation section offers a comprehensive assessment of the implemented solution. In Conclusion, insights gained are reflected upon, and the contributions and results of the preceding chapters are reviewed. \par