bachelor-thesis/thesis/content/70_conclusion.tex

\chapter{Conclusion And Outlook}
\label{chap:conclusion}

%  Schlußfolgerungen, Fragen, Ausblicke

% Dieses Kapitel ist sicherlich das am Schwierigsten zu schreibende. Es
% dient einer gerafften Zusammenfassung dessen, was man gelernt hat. Es
% ist möglicherweise gespickt von Rückwärtsverweisen in den Text, um dem
% faulen aber interessierten Leser (der Regelfall) doch noch einmal die
% Chance zu geben, sich etwas fundierter weiterzubilden. Manche guten
% Arbeiten werfen mehr Probleme auf als sie lösen. Dies darf man ruhig
% zugeben und diskutieren. Man kann gegebenenfalls auch schreiben, was
% man in dieser Sache noch zu tun gedenkt oder den Nachfolgern ein paar
% Tips geben. Aber man sollte nicht um jeden Preis Fragen, die gar nicht
% da sind, mit Gewalt aufbringen und dem Leser suggerieren, wie
% weitsichtig man doch ist. Dieses Kapitel muß kurz sein, damit es
% gelesen wird. Sollte auch "Future Work" beinhalten.

In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par

Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that a threads work was performed on current data. \par

In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par

As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage features of the \gls{dsa} not implemented in the library. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, currently only supporting busy-waiting. This lead us to extend the design by implement weak-waiting in Section \ref{sec:impl:application}, favouring cache misses instead of wasting resources during the wait. Additionally, access through a \glsentrylong{dsa:dwq} has the potential to reduce submission cost and thereby increase the \texttt{Caches'} effectiveness. \par

Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Given that these conditions align with those typically found in \gls{numa}-optimized applications, such a prerequisite is not unrealistic to expect. The utility of the \texttt{Cache} is not limited to prefetching alone; it offers a solution for replicating data to or from \gls{nvram} and might prove applicable to other use cases. Additional benchmarks on more complex queries for \gls{qdp} and a comparison between prefetching to \gls{hbm} and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could yield deeper insights into the \texttt{Caches'} performance. \par

In conclusion, the developed library together with our exploration of architecture and performance of the \gls{dsa} fulfil the stated goal of this work. We have achieved performance gains through offloading data movement for \gls{qdp}, thereby demonstrating the \gls{dsa}s potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par

%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: