From eae4e5846d217e69e9bf9ad5a97281ac6aaff74c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Constantin=20F=C3=BCrst?= <c@fuersten.info>
Date: Fri, 9 Feb 2024 23:58:30 +0100
Subject: [PATCH] add even more todos

---
 thesis/bachelor.tex                  | 3 +++
 thesis/content/50_implementation.tex | 2 +-
 thesis/content/60_evaluation.tex     | 3 ++-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/thesis/bachelor.tex b/thesis/bachelor.tex
index abf4429..292accc 100644
--- a/thesis/bachelor.tex
+++ b/thesis/bachelor.tex
@@ -99,8 +99,11 @@ plainpages=false,pdfpagelabels=true]{hyperref}
 
 \setglossarystyle{altlistgroup}
 \printglossaries
+\todo{write glossary entries}
+\cleardoublepage
 
 \printbibliography
+
 \iffalse
     % an aid for Kile autocompletion
     \bibliography{own.bib}
diff --git a/thesis/content/50_implementation.tex b/thesis/content/50_implementation.tex
index b2621dc..28061ea 100644
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@@ -106,7 +106,7 @@ This secondary value allows \(T_2\) to pass the wait, then perform the exchange
 \section{Application to \glsentrylong{qdp}}
 \label{sec:impl:application}
 
-Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
+Applying the \texttt{Cache} to \gls{qdp} is a straightforward process. We adapted the benchmarking code developed by Anna Bartuschka and André Berthold \cite{dimes-prefetching}, invoking Cache::Access for both prefetching and cache access. Due to the high amount of smaller submissions, we decided to forego splitting of tasks unto multiple \gls{dsa} and instead distribute the copy tasks per thread in round-robin fashion to all available \todo{this is false, we used only the 4 on-socket ones, update this or redo the subsequent benchmarks}. This causes less delay due to submission cost which, as shown in Section \ref{subsec:perf:submitmethod}, rises with smaller tasks. The cache location is fixed to \gls{numa:node} 8, the \gls{hbm} accessor of \gls{numa:node} 0 to which the application will be bound and therefore exclusively run on. \par
 
 During the performance analysis of the developed \texttt{Cache}, we discovered that \gls{intel:dml} does not utilize interrupt-based completion signaling (Section \ref{subsubsec:state:completion-signal}), but instead employs busy-waiting on the completion descriptor being updated. Given that this busy waiting incurs CPU cycles, waiting on task completion is deemed impractical, necessitating code modifications. We extended \texttt{CacheData} and Cache to incorporate support for weak waiting. By introducing a flag configurable in \texttt{Cache}, all instances of \texttt{CacheData} created via \texttt{Cache::Access} will check only once whether the \gls{dsa} has completed processing \texttt{Cache} operation, and if not, return without updating the cache-pointer. Consequently, calls to \texttt{CacheData::GetDataLocation} may return \texttt{nullptr} even after waiting, placing the responsibility on the user to access the data through its source location. For applications prioritizing latency, \texttt{Cache::Access} offers the option for weak access. When activated, the function returns only existing instances of \texttt{CacheData}, thereby avoiding work submission to the \gls{dsa} if the address has not been previously cached or was flushed since the last access. Using these two options, we can avoid work submission and busy waiting where access latency is paramount. \par
 
diff --git a/thesis/content/60_evaluation.tex b/thesis/content/60_evaluation.tex
index 803118b..b9f50c3 100644
--- a/thesis/content/60_evaluation.tex
+++ b/thesis/content/60_evaluation.tex
@@ -13,6 +13,7 @@
 In this chapter we will define our expectations, applying the developed Cache to \glsentrylong{qdp}, and then evaluate the observed results. The code used is described in more detail in Section \ref{sec:impl:application}. \par
 
 \todo{improve the sections in this chapter}
+\todo{qdp bench code for now uses only the 4 on-node dsas, check (given time) whether round-robinning to all available makes a positive difference}
 
 \section{Benchmarked Task}
 \label{sec:eval:bench}
@@ -61,7 +62,7 @@ From Table \ref{table:qdp-baseline} it is obvious, that accessing column \texttt
 
 Due to the higher bandwidth that \gls{hbm} provides \(AGGREGATE\), the CPU waits less on data from main memory, improving effective processing times. This is evident by the overall time of \(AGGREGATE\) shortening in Figure \ref{fig:timing-comparison:upplimit}, compared to the baseline from Figure \ref{fig:timing-comparison:baseline}. Therefore, more threads were tasked with \(SCAN_a\), which still accesses data from \glsentryshort{dram} and less would perform \(AGGREGATE\), compared to the scenario for data wholly in \glsentryshort{dram}. That is the reason why the former method not only outperforms \gls{dram} for \(AGGREGATE\) but also for \(SCAN_a\). See raw timing values for both in the legends of Figure \ref{fig:timing-comparison}. \par
 
-With these comparison values we will now evaluate the results from enabling prefetching. \par
+With these comparison values we will now evaluate the results from enabling prefetching. \todo{this transition is not good} \par
 
 \begin{table}[h!tb]
     \centering