|
@ -0,0 +1,34 @@ |
|
|
|
|
|
# peak performance |
|
|
|
|
|
- meassure ddr to ddr, intra-node |
|
|
|
|
|
- meassure ddr to hbm, intra-node |
|
|
|
|
|
- meassure ddr to ddr, inter-node |
|
|
|
|
|
- meassure ddr to hbm, inter-node |
|
|
|
|
|
- meassure ddr to ddr, inter-socket |
|
|
|
|
|
- meassure ddr to hbm, inter-socket |
|
|
|
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB |
|
|
|
|
|
All for HW and also SW path |
|
|
|
|
|
--> conclude how much overhead DSA engine has |
|
|
|
|
|
--> conclude size after which using HW makes sense |
|
|
|
|
|
this point is reached when submit overhead for |
|
|
|
|
|
hw execution is smaller than entire copy time |
|
|
|
|
|
for sw execution |
|
|
|
|
|
# submit // done |
|
|
|
|
|
- single submit-and-wait |
|
|
|
|
|
- multi submit |
|
|
|
|
|
- batch submit |
|
|
|
|
|
All with both 1 and 4 engines per WQ |
|
|
|
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB but only ddr-ddr intra node |
|
|
|
|
|
--> conclude which work submission strategy is best for which size |
|
|
|
|
|
--> conclude whether multiple engines significantly improve batch perf |
|
|
|
|
|
# MT submit |
|
|
|
|
|
- multiple threads submit to the same WQ |
|
|
|
|
|
- use 1,2,4,8,12 threads |
|
|
|
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB but only ddr-ddr intra node |
|
|
|
|
|
All for 1 vs 4 engines |
|
|
|
|
|
--> conclude how bad mt submit hurts performance |
|
|
|
|
|
--> conclude whether multiple engines help mt submit |
|
|
|
|
|
# cross copy // done |
|
|
|
|
|
- compare which is faster: xcopy, copy from source node, copy from dst node |
|
|
|
|
|
All for both inter-node and inter-socket copy using DDR and 1MiB on 4E |
|
|
|
|
|
--> conclude where a copy thread should live |
|
|
|
|
|
|