This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

35 lines
5.8 KiB

\chapter{Conclusion And Outlook}
\label{chap:conclusion}
% Schlußfolgerungen, Fragen, Ausblicke
% Dieses Kapitel ist sicherlich das am Schwierigsten zu schreibende. Es
% dient einer gerafften Zusammenfassung dessen, was man gelernt hat. Es
% ist möglicherweise gespickt von Rückwärtsverweisen in den Text, um dem
% faulen aber interessierten Leser (der Regelfall) doch noch einmal die
% Chance zu geben, sich etwas fundierter weiterzubilden. Manche guten
% Arbeiten werfen mehr Probleme auf als sie lösen. Dies darf man ruhig
% zugeben und diskutieren. Man kann gegebenenfalls auch schreiben, was
% man in dieser Sache noch zu tun gedenkt oder den Nachfolgern ein paar
% Tips geben. Aber man sollte nicht um jeden Preis Fragen, die gar nicht
% da sind, mit Gewalt aufbringen und dem Leser suggerieren, wie
% weitsichtig man doch ist. Dieses Kapitel muß kurz sein, damit es
% gelesen wird. Sollte auch "Future Work" beinhalten.
With this work, we set out to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. Within Section \ref{sec:state:dsa} we characterized the hardware and software architecture, following with the overview of an available programming interface, \glsentrylong{intel:dml}, in \ref{sec:state:dml}. Our benchmarks were directed at the planned application and therefore included copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}) and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). We discovered an anomaly for inter-socket copy speeds and found scaling of throughput to be distinctly below linear (Figure \ref{fig:perf-dsa-analysis:scaling}). Even though not all observations were explainable, the results give important insights into the behaviour of the \gls{dsa} and how it can be used in applications, complimenting existing analysis \cite{intel:analysis} with detail for multisocket systems and \gls{hbm}. \par
Applying the cache developed over Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp} we discovered challenges concerning the available memory bandwidth and lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. With the \texttt{Cache} being a substantial contribution to the field, we will review its shortcomings and effectiveness. \par
At the current state, the applicability is limited to data that is mutated infrequently. Even though support exists for entry invalidation, no ordering guarantees are given and invalidation is performed strictly manually. This requires effort by the programmer in ensuring validity of results after invalidation, as threads may still hold pointers to now-invalid data. As a remedy for this situation, a custom type could be developed, automatically calling invalidation for itself through the cache upon modification. A timestamp or other form of unique identifier in this data type may be used to tag results, enabling checks for source data validity. \par
In Section \ref{sec:eval:observations} we observed negative effects when prefetching with the cache during parallel execution of memory-bound operations. With the simple query (see Section \ref{sec:state:qdp}) benchmarked, these effects were pronounced, as copy operations to cache directly cut into the bandwidth required by the pipeline, slowing overall progress more than the gains of accessing data through \gls{hbm} made up. This required distributing data over multiple \glsentrylong{numa:node}s to avoid affecting the bandwidth available to the pipelines by caching. Existing applications developed to run on \gls{numa} can be assumed to already perform this distribution to utilize the systems available memory bandwidth, therefore we do not consider this a major fault of the \texttt{Cache}. \par
As noted in Sections \ref{sec:state:dml} and \ref{sec:impl:application}, the \gls{api} used to interact with \gls{dsa} does not currently support interrupt-based completion waiting and use of \glsentrylong{dsa:dwq}. Further development may focus on directly accessing the \gls{dsa}, foregoing the \glsentrylong{intel:dml}, take advantage of the complete feature set. Especially interrupt-based waiting would greatly enhance the usability of the \texttt{Cache} which in the current state only allows for busy-waiting, which we judge to be unacceptable and therefore opted to implement weak access and waiting in Section \ref{sec:impl:application}. \par
The previous paragraphs and the results in Chapter \ref{chap:evaluation} could lead the reader to conclude that the \texttt{Cache} requires extensive work to be usable in production applications. We however posit the opposite. In favourable conditions, which, as previously mentioned, may be assumed for \glsentryshort{numa}-aware applications, we observed noticeable speed-up by using the \texttt{Cache} to perform prefetching to \glsentrylong{hbm} to accelerate database queries. Its use is not limited to prefetching however, presenting a solution to handle transfers from remote memory to faster local storage in the background or replicating data to Non-Volatile RAM. Evaluating its effectiveness in these scenarios and performing further benchmarks on more complex scenarios for \gls{qdp} might yield more concrete insights into these diverse fields of application. \par
Finally, the \texttt{Cache}, together with the Sections on the \gls{dsa}s architecture (\ref{sec:state:dsa}) and its performance characteristics (\ref{sec:perf:bench}) complete the stated goal of this work. We achieved performance gains through the \gls{dsa} in \gls{qdp}, and thereby demonstrated its potential to aid in exploiting the properties offered by the multitude of storage tiers in heterogeneous memory systems. \par
%%% Local Variables:
%%% TeX-master: "diplom"
%%% End: