improve writing of chapter 7

10 months ago · 9488204aea
2 changed files with 6 additions and 8 deletions
--- a/thesis/bachelor.pdf
+++ b/thesis/bachelor.pdf
--- a/thesis/content/70_conclusion.tex
+++ b/thesis/content/70_conclusion.tex
@ -16,19 +16,17 @@
 % weitsichtig man doch ist. Dieses Kapitel muß kurz sein, damit es
 % gelesen wird. Sollte auch "Future Work" beinhalten.

-With this work, we set out to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. Within Section \ref{sec:state:dsa} we characterized the hardware and software architecture, following with the overview of an available programming interface, \glsentrylong{intel:dml}, in \ref{sec:state:dml}. Our benchmarks were directed at the planned application and therefore included copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}) and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). We discovered an anomaly for inter-socket copy speeds and found scaling of throughput to be distinctly below linear (Figure \ref{fig:perf-dsa-analysis:scaling}). Even though not all observations were explainable, the results give important insights into the behaviour of the \gls{dsa} and how it can be used in applications, complimenting existing analysis \cite{intel:analysis} with detail for multisocket systems and \gls{hbm}. \par
+In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par

-Applying the cache developed over Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp} we discovered challenges concerning the available memory bandwidth and lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. With the \texttt{Cache} being a substantial contribution to the field, we will review its shortcomings and effectiveness. \par
+Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, we acknowledge its limitations and effectiveness. At present, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, no ordering guarantees are provided, necessitating manual invalidation efforts by the programmer to ensure result validity (see Section \ref{subsec:design:restrictions}). To address this, a custom type could be developed to automatically trigger invalidation through the cache upon modification, utilizing timestamps or unique identifiers to verify that results are based on the current state. \par

-At the current state, the applicability is limited to data that is mutated infrequently. Even though support exists for entry invalidation, no ordering guarantees are given and invalidation is performed strictly manually. This requires effort by the programmer in ensuring validity of results after invalidation, as threads may still hold pointers to now-invalid data. As a remedy for this situation, a custom type could be developed, automatically calling invalidation for itself through the cache upon modification. A timestamp or other form of unique identifier in this data type may be used to tag results, enabling checks for source data validity. \par
+In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par

-In Section \ref{sec:eval:observations} we observed negative effects when prefetching with the cache during parallel execution of memory-bound operations. With the simple query (see Section \ref{sec:state:qdp}) benchmarked, these effects were pronounced, as copy operations to cache directly cut into the bandwidth required by the pipeline, slowing overall progress more than the gains of accessing data through \gls{hbm} made up. This required distributing data over multiple \glsentrylong{numa:node}s to avoid affecting the bandwidth available to the pipelines by caching. Existing applications developed to run on \gls{numa} can be assumed to already perform this distribution to utilize the systems available memory bandwidth, therefore we do not consider this a major fault of the \texttt{Cache}. \par
+As highlighted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} utilized to interact with the \gls{dsa} currently lacks support for interrupt-based completion waiting and the use of \glsentrylong{dsa:dwq}. Future development efforts may focus on direct \gls{dsa} access, bypassing the \glsentrylong{intel:dml}, to leverage the complete feature set. Particularly, interrupt-based waiting would significantly enhance the usability of the \texttt{Cache}, which currently only supports busy-waiting. \par

-As noted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} used to interact with \gls{dsa} does not currently support interrupt-based completion waiting and use of \glsentrylong{dsa:dwq}. Further development may focus on directly accessing the \gls{dsa}, foregoing the \glsentrylong{intel:dml}, take advantage of the complete feature set. Especially interrupt-based waiting would greatly enhance the usability of the \texttt{Cache} which in the current state only allows for busy-waiting, which we judge to be unacceptable and therefore opted to implement weak access and waiting in Section \ref{sec:impl:application}. \par
+Although the preceding paragraphs and the results in Chapter \ref{chap:evaluation} might suggest that the \texttt{Cache} requires extensive refinement for production applications, we argue the opposite. Under favourable conditions, as assumed for \glsentryshort{numa}-aware applications, we observed significant speed-up using the \texttt{Cache} for prefetching to \glsentrylong{hbm}, accelerating database queries. Its utility is not limited to prefetching alone; it offers a solution for handling transfers from remote memory to faster local storage or replicating data to Non-Volatile RAM. Further evaluation of the caches effectiveness in these scenarios and additional benchmarks on more complex queries for \gls{qdp} may provide deeper insights into its performance in for these applications. \par

-The previous paragraphs and the results in Chapter \ref{chap:evaluation} could lead the reader to conclude that the \texttt{Cache} requires extensive work to be usable in production applications. We however posit the opposite. In favourable conditions, which, as previously mentioned, may be assumed for \glsentryshort{numa}-aware applications, we observed noticeable speed-up by using the \texttt{Cache} to perform prefetching to \glsentrylong{hbm} to accelerate database queries. Its use is not limited to prefetching however, presenting a solution to handle transfers from remote memory to faster local storage in the background or replicating data to Non-Volatile RAM. Evaluating its effectiveness in these scenarios and performing further benchmarks on more complex scenarios for \gls{qdp} might yield more concrete insights into these diverse fields of application. \par
-
-Finally, the \texttt{Cache}, together with the Sections on the \gls{dsa}s architecture (\ref{sec:state:dsa}) and its performance characteristics (\ref{sec:perf:bench}) complete the stated goal of this work. We achieved performance gains through the \gls{dsa} in \gls{qdp}, and thereby demonstrated its potential to aid in exploiting the properties offered by the multitude of storage tiers in heterogeneous memory systems. \par
+In conclusion, the \texttt{Cache}, together with the sections on the \gls{dsa}'s architecture (\ref{sec:state:dsa}) and performance characteristics (\ref{sec:perf:bench}), fulfil the stated goal of this work. We have achieved performance gains through the \gls{dsa} in \gls{qdp}, thereby demonstrating its potential to facilitate the exploitation of the properties offered by the various storage tiers in heterogeneous memory systems. \par

 %%% Local Variables:
 %%% TeX-master: "diplom"