This contains my bachelors thesis and associated tex files, code snippets and maybe more. Topic: Data Movement in Heterogeneous Memories with Intel Data Streaming Accelerator
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

90 lines
3.9 KiB

  1. \chapter{Performance Microbenchmarks}
  2. \label{sec:perf}
  3. % Ist das zentrale Kapitel der Arbeit. Hier werden das Ziel sowie die
  4. % eigenen Ideen, Wertungen, Entwurfsentscheidungen vorgebracht. Es kann
  5. % sich lohnen, verschiedene Möglichkeiten durchzuspielen und dann
  6. % explizit zu begründen, warum man sich für eine bestimmte entschieden
  7. % hat. Dieses Kapitel sollte - zumindest in Stichworten - schon bei den
  8. % ersten Festlegungen eines Entwurfs skizziert werden.
  9. % Es wird sich aber in einer normal verlaufenden
  10. % Arbeit dauernd etwas daran ändern. Das Kapitel darf nicht zu
  11. % detailliert werden, sonst langweilt sich der Leser. Es ist sehr
  12. % wichtig, das richtige Abstraktionsniveau zu finden. Beim Verfassen
  13. % sollte man auf die Wiederverwendbarkeit des Textes achten.
  14. % Plant man eine Veröffentlichung aus der Arbeit zu machen, können von
  15. % diesem Kapitel Teile genommen werden. Das Kapitel wird in der Regel
  16. % wohl mindestens 8 Seiten haben, mehr als 20 können ein Hinweis darauf
  17. % sein, daß das Abstraktionsniveau verfehlt wurde.
  18. \begin{itemize}
  19. \item
  20. \end{itemize}
  21. \section{Benchmarking Methodology}
  22. \begin{itemize}
  23. \item
  24. \end{itemize}
  25. \section{Submission Method}
  26. \begin{itemize}
  27. \item submit cost analysis: best method and for a subset the point at which submit cost < time savings
  28. \item display the full opt-submitmethod graph
  29. \item maybe remeassure with higher amount of small copies? results look somewhat weird for 1k and 4k
  30. \item display the stacked bar of submit and complete time for single@1k, single@4k, single@1mib for HW-path and SW-path
  31. \item display the stacked bar of submit and complete time for batch50@1k, batch50@4k, batch50@1mib for HW-path and SW-path
  32. \item show batch because we care about the minimum task set size for a single producer (multi submit would be used for different task sets)
  33. \item conclude at which point using the DSA makes sense
  34. \end{itemize}
  35. \section{Multithreaded Submission}
  36. \begin{itemize}
  37. \item effect of mt-submit, low because \gls{dsa:swq} implicitly synchronized, bandwidth is shared
  38. \item show results for all available core counts
  39. \item only display the 1engine tests
  40. \item show combined total throughput
  41. \item conclude that due to the implicit synchronization the sync-cost also affects 1t and therefore it makes no difference, bandwidth is shared, no guarantees on fairness
  42. \end{itemize}
  43. \section{Multiple Engines in a Group}
  44. \begin{itemize}
  45. \item assumed from arch spec that multiple engines lead to greater Performance
  46. \item reason is that page faults and access latency will be overlapped with preparing the next operation
  47. \item in the given scenario we observe the opposite, slight performance decrease
  48. \item show multisubmit 50 for both 1e and 4e
  49. \item maybe remeassure with each submission accessing different memory region?
  50. \item conclusion?
  51. \end{itemize}
  52. \section{Data Movement from DDR to HBM}
  53. \begin{itemize}
  54. \item present two copy methods: smart and brute force
  55. \item show graph for ddr->hbm intranode, ddr->hbm intrasocket, ddr->hbm intersocket
  56. \item conclude which option makes more sense (smart)
  57. \item because 4x or 2x utilization for only 1.5x or 1.25x speedup respectively
  58. \item maybe benchmark smart-copy intersocket in parallel with two smart-copies intrasocket VS. the same task with brute force
  59. \end{itemize}
  60. \section{Analysis}
  61. \begin{itemize}
  62. \item summarize the conclusions and define the point at which dsa makes sense
  63. \item minimum transfer size for batch/nonbatch operation
  64. \item effect of mtsubmit -> no fairness guarantees
  65. \item usage of multiple engines -> no effect
  66. \item smart copy method as the middle-ground between peak throughput and utilization
  67. \item lower utilization of dsa is good when it will be shared between threads/processes
  68. \end{itemize}
  69. \cleardoublepage
  70. %%% Local Variables:
  71. %%% TeX-master: "diplom"
  72. %%% End: