diff --git a/benchmarks/benchmark-plan.md b/benchmarks/benchmark-plan.md index 26cea13..4950ad1 100644 --- a/benchmarks/benchmark-plan.md +++ b/benchmarks/benchmark-plan.md @@ -1,12 +1,9 @@ -# peak performance -- meassure ddr to ddr, intra-node -- meassure ddr to hbm, intra-node -- meassure ddr to ddr, inter-node -- meassure ddr to hbm, inter-node -- meassure ddr to ddr, inter-socket -- meassure ddr to hbm, inter-socket +# peak-perf +- meassure ddr to ddr +- meassure ddr to hbm All for 1KiB, 4KiB, 1MiB, 1GiB All for HW and also SW path +All for intra-node, inter-node and inter-socket --> conclude how much overhead DSA engine has --> conclude size after which using HW makes sense this point is reached when submit overhead for @@ -17,17 +14,19 @@ All for HW and also SW path - multi submit - batch submit All with both 1 and 4 engines per WQ -All for 1KiB, 4KiB, 1MiB, 1GiB but only ddr-ddr intra node +All for 1KiB, 4KiB, 1MiB, 1GiB +All only on DDR and intra-node --> conclude which work submission strategy is best for which size --> conclude whether multiple engines significantly improve batch perf -# MT submit +# mtsubmit // done - multiple threads submit to the same WQ - use 1,2,4,8,12 threads -All for 1KiB, 4KiB, 1MiB, 1GiB but only ddr-ddr intra node +All using DDR and 1MiB All for 1 vs 4 engines +All on DDR and intra-node --> conclude how bad mt submit hurts performance --> conclude whether multiple engines help mt submit -# cross copy // done +# cross-copy // done - compare which is faster: xcopy, copy from source node, copy from dst node All for both inter-node and inter-socket copy using DDR and 1MiB on 4E --> conclude where a copy thread should live