@ -55,7 +55,7 @@ The simple query presents a challenging scenario for the \texttt{Cache}. The exe
In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per processing stage and the amount of groups influences performance \cite{dimes-prefetching}, which however is out of scope for this work. Therefore, results shown are for the best configurations measured. \par
The plots for detailed timing are normalized so that the longest running configuration fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-task timings do not cover. Therefore, a discrepancy between the raw runtime values from the Tables and the Figures may be observed. \par
The plots for detailed timing are normalized so that the longest running configuration fills the half-circle. As waiting times at the barriers, which can vary by workload, are not displayed here, the graphs do not fully represent the total execution time. Additionally, the total runtime also encompasses some overhead that the per-task timings do not cover. Therefore, a discrepancy between the raw values from the Tables, and the Figures may be observed. \par
\subsection{Benchmarks without Prefetching}
@ -67,7 +67,7 @@ Due to the higher bandwidth provided by accessing column \texttt{b} through \gls
\subsection{Benchmarks using Prefetching}
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario, through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
To address the challenges posed by sharing memory bandwidth between both \(SCAN\)-operations, we will conduct the prefetching benchmarking in two configurations. Firstly, both columns \texttt{a} and \texttt{b} will be situated on the same \gls{numa:node}. We anticipate demonstrating the memory bottleneck in this scenario through increased execution time of \(SCAN_a\). Secondly, we will distribute the columns across two \gls{numa:node}s, both still utilizing \glsentryshort{dram}. In this configuration, the memory bottleneck is alleviated, leading us to anticipate better performance compared to the former setup. \par
\begin{table}[t]
\centering
@ -97,7 +97,7 @@ We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for pref
\label{fig:timing-results}
\end{figure}
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it offloads memory access to the \gls{dsa} through the \texttt{Cache}, thereby not showing extended runtime from the throughput bottleneck. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the scan and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it offloads memory access to the \gls{dsa} through the \texttt{Cache}, thereby not showing extended runtime from the throughput bottleneck. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the \(SCAN\)-stage and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par
@ -105,7 +105,7 @@ Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, wh
In Section \ref{sec:eval:expectations}, we anticipated that the simple query would pose a challenging case for prefetching. This expectation proved to be accurate, highlighting that improper data distribution can lead to adverse effects on performance when utilizing the \texttt{Cache}. Thus, we consider the chosen scenario to be well-suited, as it showcases both performance gains and losses, underscoring the importance of optimizing parameters and scenarios to achieve positive outcomes. \par
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par
The necessity to distribute data across \gls{numa:node}s is seen as practical, given that developers commonly apply this optimization to leverage the available memory bandwidth of \glsentrylong{numa}s. Consequently, the \texttt{Cache} has demonstrated its effectiveness by achieving a respectable speed-up, positioned directly between the baseline and the theoretical upper limit (see Table \ref{table:qdp-speedup}). \par
As stated in Chapter \ref{chap:design}, the decision to design and implement a cache instead of focusing solely on prefetching was made to enhance the usefulness of this work's contribution. While our tests were conducted on a system with \gls{hbm}, other advancements in main memory technologies, such as \gls{nvram}, were not considered. Despite the methods of the \texttt{Cache}-class being named with usage as a cache in mind, its utility extends beyond this scope, providing flexibility through the policy functions, described in Section \ref{subsec:design:policy-functions}. Potential applications include replication to \gls{nvram} for data loss prevention, or restoring from \gls{nvram} for faster processing. Therefore, we consider the increase in design complexity to be a worthwhile trade-off, providing a significant contribution to the field of heterogeneous memory systems. \par
In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par
Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that a threads work was performed on current data. \par
Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address the difficulties posed by mutable data, a custom container could be developed, as mentioned in Section \ref{sec:design:restrictions}. \par
In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par
As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage features of the \gls{dsa} not implemented in the library. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, currently only supporting busy-waiting. This lead us to extend the design by implement weak-waiting in Section \ref{sec:impl:application}, favouring cache misses instead of wasting resources during the wait. Additionally, access through a \glsentrylong{dsa:dwq} has the potential to reduce submission cost and thereby increase the \texttt{Caches'} effectiveness. \par
\pagebreak
Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Given that these conditions align with those typically found in \gls{numa}-optimized applications, such a prerequisite is not unrealistic to expect. The utility of the \texttt{Cache} is not limited to prefetching alone; it offers a solution for replicating data to or from \gls{nvram} and might prove applicable to other use cases. Additional benchmarks on more complex queries for \gls{qdp} and a comparison between prefetching to \gls{hbm} and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could yield deeper insights into the \texttt{Caches'} performance. \par
In conclusion, the developed library together with our exploration of architecture and performance of the \gls{dsa} fulfil the stated goal of this work. We have achieved performance gains through offloading data movement for \gls{qdp}, thereby demonstrating the \gls{dsa}s potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par