Browse Source

work all todos for chapters 6 and 7

master
Constantin Fürst 10 months ago
parent
commit
3f48127442
  1. BIN
      thesis/bachelor.pdf
  2. 20
      thesis/content/60_evaluation.tex
  3. 2
      thesis/content/70_conclusion.tex

BIN
thesis/bachelor.pdf

20
thesis/content/60_evaluation.tex

@ -10,16 +10,14 @@
% bezüglich der Ergebnisse zu erläutern und anschließend eventuell % bezüglich der Ergebnisse zu erläutern und anschließend eventuell
% festgestellte Abweichungen zu erklären. % festgestellte Abweichungen zu erklären.
In this chapter, we establish anticipated outcomes for incorporating the developed Cache into \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the choices made regarding the benchmarking methodology and their influence on our results. \todo{extend last sentence with what we actually do in 6.4} \par
In this chapter, we establish anticipated outcomes for incorporating the developed \texttt{Cache} into \glsentrylong{qdp}, followed by a comprehensive assessment of the achieved results. The specifics of the benchmark are elaborated upon in Section \ref{sec:impl:application}. We conclude with a discussion of the \texttt{Cache} and the results observed. \par
\section{Benchmarked Task} \section{Benchmarked Task}
\label{sec:eval:bench} \label{sec:eval:bench}
The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at 4 GiB. The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching. \par
The benchmark involves the execution of a simple query, as depicted in Figure \ref{fig:qdp-simple-query}. We will henceforth denote \(SCAN_a\) as the pipeline responsible for scanning and subsequently filtering column \texttt{a}, \(SCAN_b\) as the pipeline tasked with prefetching column \texttt{b} and \(AGGREGATE\) as the projection and final summation step. The column size utilized is set at 4 GiB. The workload is distributed across multiple groups, with each group spawning threads for every pipeline step. To ensure equitable comparison, each tested configuration employs 64 threads for the initial stage (\(SCAN_a\) and \(SCAN_b\)) and 32 subsequently (\(AGGREGATE\)), while being constrained to execute on \gls{numa:node} 0 through pinning. For configurations without prefetching, \(SCAN_b\) is omitted. We measure total and per-pipeline duration and cache hit percentage for prefetching for 5 iterations with 5 previous warm-up runs, and form the average. \par
The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. \todo{mention that this has one of two side effects: queue overflow or requiring larger task sizes to prevent the overflow} \par
\todo{possibly describe iteration count, warmup and averaging over iterations here}
The pipelines \(SCAN_a\) and \(SCAN_b\) execute concurrently, completing their tasks before signalling \(AGGREGATE\) for finalization. In a bid to enhance the cache hit rate, we opted to relax this constraint, allowing \(SCAN_b\) to operate independently, while only synchronizing \(SCAN_a\) with \(AGGREGATE\). Consequently, work is submitted to the \gls{dsa} as frequently as possible, aiming to complete caching operations for a chunk of \texttt{b} before \(SCAN_a\) finalizes processing the corresponding part of \texttt{a}. This burst-submission could cause the \gls{dsa}s work queue to overrun, leading us to increase the size of the blocks for the benchmarks utilizing \gls{qdp} to avoid this. \par
\section{Expectations} \section{Expectations}
\label{sec:eval:expectations} \label{sec:eval:expectations}
@ -29,9 +27,7 @@ The simple query presents a challenging scenario for the \texttt{Cache}. The exe
\section{Observations} \section{Observations}
\label{sec:eval:observations} \label{sec:eval:observations}
In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influence performance \cite{dimes-prefetching}, which however is not inside the scope of this work. Therefore, we only present the best benchmarked results for each configuration. \par
\todo{wording of last sentence is a bit odd}
In this section, we will present our findings from integrating the \texttt{Cache} developed in Chapters \ref{chap:design} and \ref{chap:implementation} into \gls{qdp}. We commence by presenting results obtained without prefetching, which serve as a reference for evaluating the effectiveness of our \texttt{Cache}. For all results presented, the amount of threads per pipeline and the amount of groups influences performance \cite{dimes-prefetching}, which however is out of scope for this work. Therefore, results shown are for the best configurations measured. \par
\subsection{Benchmarks without Prefetching} \subsection{Benchmarks without Prefetching}
@ -80,9 +76,7 @@ To address the challenges posed by sharing memory bandwidth between both \(SCAN\
\label{table:qdp-speedup} \label{table:qdp-speedup}
\end{table} \end{table}
The slowdown below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, this result becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, and there is additional overhead from the \texttt{Cache}. Distributing the columns across different \gls{numa:node}s then results in a noticeable performance increase compared to our baseline, although it does not reach the upper boundary set by simulating perfect prefetching. This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par
\todo{add reference to the table for this paragraph}
We now examine Table \ref{table:qdp-speedup}, where a slowdown is shown for prefetching. This drop-off below our baseline when utilizing the \texttt{Cache} may be surprising at first glance. However, it becomes reasonable when we consider that in this scenario, the \gls{dsa}s executing the caching tasks compete for bandwidth with the \(SCAN_a\) pipeline threads, while also adding additional overhead from the \texttt{Cache} and work submission. The second measured configuration for \gls{qdp} is shown as \enquote{Prefetching, Distributed Columns} in Table \ref{table:qdp-speedup}. For this method, distributing the columns across different \gls{numa:node}s results in a noticeable performance increase compared to our baseline, although not reaching the upper boundary set by simulating perfect prefetching (called \enquote{HBM (Upper Limit)} in the same Table). This confirms our assumption that the \(SCAN_a\) pipeline itself is bandwidth-bound, as without this contention, we observe an increase in cache hit rate and decrease in processing time. We will now examine the performance in more detail with per-pipeline timings. \par
\begin{figure}[!t] \begin{figure}[!t]
\centering \centering
@ -103,9 +97,7 @@ The slowdown below our baseline when utilizing the \texttt{Cache} may be surpris
\label{fig:timing-results} \label{fig:timing-results}
\end{figure} \end{figure}
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. This prolonged duration of execution leads to extended overlaps between groups still processing \(SCAN_a\) and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate, minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
\todo{possibly mention that scanb/scana parallel have the same length as scanb is not directly affected by the bandwidth contention - it only depicts the overhead of caching and work submission - as the dsa itself performs the copy}
In Figure \ref{fig:timing-results:prefetch}, the competition for bandwidth between \(SCAN_a\) and \(SCAN_b\) is evident, with \(SCAN_a\) showing significantly longer execution times. \(SCAN_b\) is nearly unaffected, as it only performs work submission through the \texttt{Cache} and therefore does not access system memory directly. This prolonged duration of execution in \(SCAN_a\) leads to extended overlaps between groups still processing the scan and those engaged in \(AGGREGATE\). Consequently, despite the relatively high cache hit rate (see Table \ref{table:qdp-speedup}), minimal speed-up is observed for \(AGGREGATE\) compared to the baseline depicted in Figure \ref{fig:timing-comparison:baseline}. The extended runtime can be attributed to the prolonged duration of \(SCAN_a\). \par
Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par Regarding the benchmark depicted in Figure \ref{fig:timing-results:prefetch}, where we distributed columns \texttt{a} and \texttt{b} across two nodes, the parallel execution of prefetching tasks on \gls{dsa} does not directly impede the bandwidth available to \(SCAN_a\). However, there is a discernible overhead associated with cache utilization, as evident in the time spent in \(SCAN_b\). Consequently, both \(SCAN_a\) and \(AGGREGATE\) operations experience slightly longer execution times than the theoretical peak our upper-limit in Figure \ref{fig:timing-comparison:upplimit} exhibits. \par

2
thesis/content/70_conclusion.tex

@ -18,7 +18,7 @@
In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par In this work, our aim was to analyse the architecture and performance of the \glsentrylong{dsa} and integrate it into \glsentrylong{qdp}. We characterized the hardware and software architecture of the \gls{dsa} in Section \ref{sec:state:dsa} and provided an overview of the available programming interface, \glsentrylong{intel:dml}, in Section \ref{sec:state:dml}. Our benchmarks were tailored to the planned application and included evaluations such as copy performance from \glsentryshort{dram} to \gls{hbm} (Section \ref{subsec:perf:datacopy}), the cost of multithreaded work submission (Section \ref{subsec:perf:mtsubmit}), and an analysis of different submission methods and sizes (Section \ref{subsec:perf:submitmethod}). Notably, we observed an anomaly in inter-socket copy speeds and found that the scaling of throughput was distinctly below linear (see Figure \ref{fig:perf-dsa-analysis:scaling}). Although not all observations were explainable, the results provided important insights into the behaviour of the \gls{dsa} and its potential application in multi-socket systems and \gls{hbm}, complementing existing analyses \cite{intel:analysis}. \par
Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, we acknowledge its limitations and effectiveness \todo{weird wording, one would acknowledge negatives but not positives too}. At present, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, no ordering guarantees are provided, necessitating manual invalidation efforts by the programmer to ensure result validity \todo{efforts are not spent invalidating but ensuring validity, effort is required to remember everything in cache and then invalidate} (see Section \ref{subsec:design:restrictions}). To address this, a custom \todo{potentiall call this a cache-accessor-type and elaborate slightly on how it works, holds the data pointer internally and only allows modification through its functions which then invalidate as last step} type could be developed to automatically trigger invalidation through the cache upon modification, utilizing timestamps or unique identifiers to verify that results are based on the current state \todo{spongy wording}. \par
Upon applying the cache developed in Chapters \ref{chap:design} and \ref{chap:implementation} to \gls{qdp}, we encountered challenges related to available memory bandwidth and the lack of feature support in the \glsentryshort{api} used to interact with the \gls{dsa}. While the \texttt{Cache} represents a substantial contribution to the field, its applicability is constrained to data that is infrequently mutated. Although support exists for entry invalidation, it is rather rudimentary, requiring manual invalidation and the developer to keep track of cached blocks and ensure they are not overlapping (see Section \ref{subsec:design:restrictions}). To address this, a custom container data type could be developed to automatically trigger invalidation through the cache upon modification and adding age tags to the data, which consumer threads can pass on. This tagging can then be used to verify that work was completed with the most current version. \par
In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par In Section \ref{sec:eval:observations}, we observed adverse effects when prefetching with the cache during the parallel execution of memory-bound operations. This necessitated data distribution across multiple \glsentrylong{numa:node}s to circumvent bandwidth competition caused by parallel caching operations. Despite this limitation, we do not consider it a major fault of the \texttt{Cache}, as existing applications designed for \gls{numa} systems are likely already optimized in this regard. \par

Loading…
Cancel
Save