@ -99,7 +99,7 @@ As mentioned in Subsection \ref{subsec:state:dsa-software-view}, \gls{intel:dml}
In the function header of Figure \ref{fig:dml-memcpy} we notice two differences, when comparing with standard memcpy. The first is the template parameter named \texttt{path} and the second being the additional parameter \texttt{int node} which we will discuss later. With \texttt{path} the executing device, which can be the CPU or \gls{dsa}, is selected, giving the option between \texttt{dml::software} (CPU), \texttt{dml::hardware} (\gls{dsa}) and \texttt{dml::automatic} where the latter dynamically selects the device at runtime, preferring \gls{dsa} over CPU execution. \cite[Sec. Quick Start]{intel:dmldoc}\par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{sec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa\_run\_on\_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the CPU node, as observed in Subsection \ref{subsection:dsa-hwarch}, the CPU Node ID is equivalent to the ID of the \gls{dsa}. As the library has limited NUMA support and therefore only utilizes the \gls{dsa} device on the node which the current thread is assigned to, we must assign the currently running thread to the node in which the desired \gls{dsa} resides. This is the reason for adding the parameter \texttt{int node}, which is used in the first step of Figure \ref{fig:dml-memcpy}, where we manually set the node assignment according to it, using \texttt{numa\_run\_on\_node(node)} for which more information may be obtained in the respective manpage of libnuma \cite{man-libnuma}. \par
\gls{intel:dml} operates on so-called data views which we must create from the given pointers and size in order to indicate data locations to the library. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)} with which we create views for both source and destination, labled \texttt{src\_view} and \texttt{dst\_view} in Figure \ref{fig:dml-memcpy}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
@ -120,6 +120,8 @@ In this section we will give a brief step-by-step list of setup instructions to
\item Inspect the now configured \gls{dsa} devices using \texttt{accel-config list}\cite[p. 7]{lenovo:dsa}, output should match the desired configuration set in the file used
@ -70,7 +70,7 @@ With the distributed locking described in \ref{subsec:implementation:cache-state
\section{Accelerator Usage}
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{sec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, which is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.
After \ref{subsec:implementation:accel-usage} the implementation of \texttt{Cache} provided leaves it up to the user to choose a caching and copy method policy which is accomplished through submitting function pointers at initialization of the \texttt{Cache}. In \ref{sec:state:setup-and-config} we configured our system to have separate \gls{numa:node}s for accessing \gls{hbm} which are assigned a \gls{numa:node}-ID by adding eight to the \gls{numa:node}s ID of the \gls{numa:node} that physically contains the \gls{hbm}. Therefore, given \gls{numa:node} 3 accesses some datum, the most efficient placement for the copy would be on \gls{numa:node}\(3+8==11\). As the \texttt{Cache} is intended for multithreaded usage, conserving accelerator resources is important, so that concurrent cache requests complete quickly. To get high per-copy performance while maintaining low usage, the smart-copy method is selected as described in \ref{subsec:perf:datacopy} for larger copies, while small copies will be handled exclusively by the current node. This distinction is made due to the overhead of assigning the current thread to the selected nodes, which is required as \gls{intel:dml} assigns submissions only to the \gls{dsa} engine present on the node of the calling thread \cite[Section "NUMA support"]{intel:dmldoc}. No testing has taken place to evaluate this overhead and determine the most effective threshold.