From 3f5f5f267d6c570828c191c89da91d54ca8ed449 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Constantin=20F=C3=BCrst?= Date: Tue, 9 Jan 2024 13:45:20 +0100 Subject: [PATCH] note down bullet points for the content of chapter 3 (performance) --- thesis/content/30_performance.tex | 36 ++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/thesis/content/30_performance.tex b/thesis/content/30_performance.tex index 0dae308..508d9d0 100644 --- a/thesis/content/30_performance.tex +++ b/thesis/content/30_performance.tex @@ -25,31 +25,61 @@ \section{Benchmarking Methodology} \begin{itemize} - \item + \item \end{itemize} \section{Submission Method} \begin{itemize} \item submit cost analysis: best method and for a subset the point at which submit cost < time savings + \item display the full opt-submitmethod graph + \item maybe remeassure with higher amount of small copies? results look somewhat weird for 1k and 4k + \item display the stacked bar of submit and complete time for single@1k, single@4k, single@1mib for HW-path and SW-path + \item display the stacked bar of submit and complete time for batch50@1k, batch50@4k, batch50@1mib for HW-path and SW-path + \item show batch because we care about the minimum task set size for a single producer (multi submit would be used for different task sets) + \item conclude at which point using the DSA makes sense \end{itemize} \section{Multithreaded Submission} \begin{itemize} \item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared + \item show results for all available core counts + \item only display the 1engine tests + \item show combined total throughput + \item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness +\end{itemize} + +\section{Multiple Engines in a Group} + +\begin{itemize} + \item assumed from arch spec that multiple engines lead to greater Performance + \item reason is that page faults and access latency will be overlapped with preparing the next operation + \item in the given scenario we observe the opposite, slight performance decrease + \item show multisubmit 50 for both 1e and 4e + \item maybe remeassure with each submission accessing different memory region? + \item conclusion? \end{itemize} \section{Data Movement from DDR to HBM} \begin{itemize} - \item + \item present two copy methods: smart and brute force + \item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket + \item conclude which option makes more sense (smart) + \item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively + \item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force \end{itemize} \section{Analysis} \begin{itemize} - \item + \item summarize the conclusions and define the point at which dsa makes sense + \item minimum transfer size for batch/nonbatch operation + \item effect of mtsubmit -> no fairness guarantees + \item usage of multiple engines -> no effect + \item smart copy method as the middle-ground between peak throughput and utilization + \item lower utilization of dsa is good when it will be shared between threads/processes \end{itemize}