@ -106,11 +106,11 @@ This secondary value allows \(T_2\) to pass the wait, then perform the exchange
\section{Application to \glsentrylong{qdp}}
\label{sec:impl:application}
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available \todo{this is false, we used only the 4 on-socket ones, update this or redo the subsequent benchmarks}. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signalling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. With this we also remove the dependency on libnuma, mentioned as a restriction in Section \ref{subsec:design:restrictions}\par
Additionally, we observed inefficiencies stemming from page fault handling. Task execution time increases when page faults are handled by the \gls{dsa}, leading to cache misses. Consequently, our execution time becomes bound to that of \gls{dram}, as misses prompt a fallback to the data's source location. When page faults are handled by the CPU during allocation, these misses are avoided. However, the execution time of the first data access through the \texttt{Cache} significantly increases due to page fault handling. One potential solution entails bypassing the system's memory management by allocating a large memory block and implementing a custom memory management scheme. As memory allocation is a complex topic, we opted to delegate this responsibility to the user by mandating the provision of new- and free-like functions akin to the policy functions utilized for determining placement and task distribution. Consequently, the benchmark can pre-allocate the required memory blocks, trigger page mapping, and subsequently pass these regions to the \texttt{Cache}. With this we also remove the dependency on libnuma, mentioned as a restriction in Section \ref{subsec:design:restrictions}.\par