\chapter{Conclusion And Outlook} \label{chap:conclusion} % Schlußfolgerungen, Fragen, Ausblicke % Dieses Kapitel ist sicherlich das am Schwierigsten zu schreibende. Es % dient einer gerafften Zusammenfassung dessen, was man gelernt hat. Es % ist möglicherweise gespickt von Rückwärtsverweisen in den Text, um dem % faulen aber interessierten Leser (der Regelfall) doch noch einmal die % Chance zu geben, sich etwas fundierter weiterzubilden. Manche guten % Arbeiten werfen mehr Probleme auf als sie lösen. Dies darf man ruhig % zugeben und diskutieren. Man kann gegebenenfalls auch schreiben, was % man in dieser Sache noch zu tun gedenkt oder den Nachfolgern ein paar % Tips geben. Aber man sollte nicht um jeden Preis Fragen, die gar nicht % da sind, mit Gewalt aufbringen und dem Leser suggerieren, wie % weitsichtig man doch ist. Dieses Kapitel muß kurz sein, damit es % gelesen wird. Sollte auch "Future Work" beinhalten. In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{sec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that work was completed with the most current version. \par In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, which currently only supports busy-waiting\todo{mention that busy waiting goes against the rationale of offloading data copy, only usefull to reduce power consumption for sync-copy, cite dsaanalysis for this}. \todo{we could also mention dwq to reduce overhead of cache and therefore time spent in scanb} \par Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from \gls{remotemem} to faster local storage or replicating data to \gls{nvram}. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its applicability. A performance comparison between prefetching to \gls{hbm} using knowledge of the coming queries and the data they access, and \enquote{HBM Cache Mode} (see Section \ref{sec:state:hbm}) could also yield interesting insights. \par In conclusion, the \texttt{Cache}, together with the Sections on the \gls{dsa}'s architecture (Section \ref{sec:state:dsa}) and performance characteristics (Section \ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par %%% Local Variables: %%% TeX-master: "diplom" %%% End: