diff --git a/benchmarks/benchmark-plan.md b/benchmarks/benchmark-plan.md index 4950ad1..1278a15 100644 --- a/benchmarks/benchmark-plan.md +++ b/benchmarks/benchmark-plan.md @@ -1,33 +1,57 @@ # peak-perf + - meassure ddr to ddr - meassure ddr to hbm + All for 1KiB, 4KiB, 1MiB, 1GiB + All for HW and also SW path + All for intra-node, inter-node and inter-socket + --> conclude how much overhead DSA engine has + --> conclude size after which using HW makes sense this point is reached when submit overhead for hw execution is smaller than entire copy time for sw execution + # submit // done + - single submit-and-wait - multi submit - batch submit + All with both 1 and 4 engines per WQ + All for 1KiB, 4KiB, 1MiB, 1GiB + All only on DDR and intra-node + --> conclude which work submission strategy is best for which size + --> conclude whether multiple engines significantly improve batch perf + # mtsubmit // done + - multiple threads submit to the same WQ - use 1,2,4,8,12 threads + All using DDR and 1MiB + All for 1 vs 4 engines + All on DDR and intra-node + --> conclude how bad mt submit hurts performance + --> conclude whether multiple engines help mt submit + # cross-copy // done + - compare which is faster: xcopy, copy from source node, copy from dst node + All for both inter-node and inter-socket copy using DDR and 1MiB on 4E + --> conclude where a copy thread should live