@ -112,11 +112,11 @@ The \texttt{path} parameter allows the selection of the executing device, which
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Subsection \ref{subsection:dsa-hwarch}, the node ID is equivalent to the ID of the \gls{dsa}. \par
Choosing the engine which carries out the copy might be advantageous for performance, as we can see in Subsection \ref{subsec:perf:datacopy}. With the engine directly tied to the processing node, as observed in Subsection \ref{subsection:dsa-hwarch}, the node ID is equivalent to the ID of the \gls{dsa}. \par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make_view(uint8_t* ptr, size_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src_view} and \texttt{dst_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
\gls{intel:dml} operates on data views, which we create from the given pointers to source and destination and size. This is done using \texttt{dml::make\_view(uint8\_t* ptr, size\_t size)}, visible in Figure \ref{fig:dml-memcpy}, where these views are labelled \texttt{src\_view} and \texttt{dst\_view}. \cite[Sec. High-level C++ API, Make view]{intel:dmldoc}\par
In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par
In Figure \ref{fig:dml-memcpy}, we submit a single descriptor using the asynchronous operation from \gls{intel:dml}. This uses the function \texttt{dml::submit<path>}, which takes an operation type and parameters specific to the selected type and returns a handler to the submitted task. For the copy operation, we pass the two views created previously. The provided handler can later be queried for the completion of the operation. After submission, we poll for the task completion with \texttt{handler.get()} and check whether the operation completed successfully. \par
A noteworthy addition to the submission-call is the use of \texttt{.block_on_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}
A noteworthy addition to the submission-call is the use of \texttt{.block\_on\_fault()}, enabling the \gls{dsa} to manage a page fault by coordinating with the operating system. It's essential to highlight that this functionality only operates if the device is configured to accept this flag. \cite[Sec. High-level C++ API, How to Use the Library]{intel:dmldoc}\cite[Sec. High-level C++ API, Page Fault handling]{intel:dmldoc}
\section{System Setup and Configuration}\label{sec:state:setup-and-config}
\section{System Setup and Configuration}\label{sec:state:setup-and-config}
@ -36,11 +36,11 @@ Even with this optimization, in scenarios where the \texttt{Cache} is frequently
\subsection{CacheData Atomicity}
\subsection{CacheData Atomicity}
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
The choice made in \ref{subsec:design:cache-entry-reuse} necessitates thread-safe shared access to the same resource. The C++ standard library provides \texttt{std::shared\_ptr<T>}, a reference-counted pointer that is thread-safe for the required operations \cite{cppreference:shared-ptr}, making it a suitable candidate for this task. Although an implementation using it was explored, it presented its own set of challenges. \par
As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
As we aim to minimize the time spent in a locked region, only the task is added to the \gls{numa:node}'s cache state when locked, with the submission taking place outside the locked region. We assume the handlers of \gls{intel:dml} to be unsafe for access from multiple threads. To achieve the safety for \texttt{CacheData::WaitOnCompletion} outlined in \ref{subsec:design:cache-entry-reuse}, threads need to coordinate which one performs the actual waiting. To avoid queuing multiple copies of the same task, the task must be added before submission. This results in a \texttt{CacheData} instance with an invalid cache pointer and no handlers to wait for, presenting an edge case to be considered. \par
Using \texttt{std::shared_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
Using \texttt{std::shared\_ptr<T>} also introduces uncertainty, relying on the implementation to be performant. The standard does not specify whether a lock-free algorithm is to be used, and \cite{shared-ptr-perf} suggests abysmal performance for some implementations, although the full article is in Korean. No further research was found on this topic. \par
Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared (though a standard pointer) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par
Therefore, the decision was made to implement atomic reference counting for \texttt{CacheData}. This involves providing a custom constructor and destructor wherein a shared (though a standard pointer) atomic integer is either incremented or decremented using atomic fetch sub and add operations \cite{cppreference:atomic-operations} to modify the reference count. In the case of a decrease to zero, the destructor was called for the last reference and then performs the actual destruction. \par