@ -106,7 +106,7 @@ This secondary value allows \(T_2\) to pass the wait, then perform the exchange
\section{Application to \glsentrylong{qdp}}
\section{Application to \glsentrylong{qdp}}
\label{sec:impl:application}
\label{sec:impl:application}
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available\todo{this is false, we used only the 4 on-socket ones, update this or redo the subsequent benchmarks}. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
\todo{improve the sections in this chapter}
\todo{improve the sections in this chapter}
\todo{qdp bench code for now uses only the 4 on-node dsas, check (given time) whether round-robinning to all available makes a positive difference}
\section{Benchmarked Task}
\section{Benchmarked Task}
\label{sec:eval:bench}
\label{sec:eval:bench}
@ -61,7 +62,7 @@ From Table \ref{table:qdp-baseline} it is obvious, that accessing column \texttt
Due to the higher bandwidth that \gls{hbm} provides \(AGGREGATE\), the CPU waits less on data from main memory, improving effective processing times. This is evident by the overall time of \(AGGREGATE\) shortening in Figure \ref{fig:timing-comparison:upplimit}, compared to the baseline from Figure \ref{fig:timing-comparison:baseline}. Therefore, more threads were tasked with \(SCAN_a\), which still accesses data from \glsentryshort{dram} and less would perform \(AGGREGATE\), compared to the scenario for data wholly in \glsentryshort{dram}. That is the reason why the former method not only outperforms \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). See raw timing values for both in the legends of Figure \ref{fig:timing-comparison}. \par
Due to the higher bandwidth that \gls{hbm} provides \(AGGREGATE\), the CPU waits less on data from main memory, improving effective processing times. This is evident by the overall time of \(AGGREGATE\) shortening in Figure \ref{fig:timing-comparison:upplimit}, compared to the baseline from Figure \ref{fig:timing-comparison:baseline}. Therefore, more threads were tasked with \(SCAN_a\), which still accesses data from \glsentryshort{dram} and less would perform \(AGGREGATE\), compared to the scenario for data wholly in \glsentryshort{dram}. That is the reason why the former method not only outperforms \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). See raw timing values for both in the legends of Figure \ref{fig:timing-comparison}. \par
With these comparison values we will now evaluate the results from enabling prefetching. \par
With these comparison values we will now evaluate the results from enabling prefetching. \todo{this transition is not good}\par