@ -27,6 +27,8 @@ With this difficult scenario, we expect to spend time analysing runtime behaviou
Consider using parts of flamegraph. Same speed as dram, even though allocation is performed in the timed region and not before. Mention dml performs busy waiting (cite dsa-paper 4.4 for use of interrupts mentioned in arch), optimization with weak wait. Mention optimization weak access for prefetching scenario.
Scan A is memory bound, therefore copying B from DRAM to HBM directly cuts into available bandwidth to A when both are located on the same node. Better performance when A and B are stored on different nodes.