From 58f36279ecc810528b1aabde5bad5858ad13cd57 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Constantin=20F=C3=BCrst?= <c@fuersten.info>
Date: Mon, 5 Feb 2024 16:32:01 +0100
Subject: [PATCH] reformulate implementation chapter - our chocie has changed
 and we now use push-pull and not smart-copy anymore for load balancer

---
 thesis/content/50_implementation.tex | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/thesis/content/50_implementation.tex b/thesis/content/50_implementation.tex
index 79a5974..41186dd 100644
--- a/thesis/content/50_implementation.tex
+++ b/thesis/content/50_implementation.tex
@@ -101,7 +101,7 @@ This secondary value allows \(T_2\) to pass the wait, then perform the exchange
 
 \section{Accelerator Usage}
 
-After \ref{sec:design:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node} \(3 + 8 = 11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, which is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.
+After \ref{sec:design:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node} \(3 + 8 = 11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the Push-Pull method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. We introduce this distinction by datum size due to the observations in Section \ref{subsec:perf:submitmethod} that with lower transfer sizes, the effect of submission cost is amplified. \par
 
 \section{Application to \glsentrylong{qdp}}