|
@ -1,33 +1,57 @@ |
|
|
# peak-perf |
|
|
# peak-perf |
|
|
|
|
|
|
|
|
- meassure ddr to ddr |
|
|
- meassure ddr to ddr |
|
|
- meassure ddr to hbm |
|
|
- meassure ddr to hbm |
|
|
|
|
|
|
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB |
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB |
|
|
|
|
|
|
|
|
All for HW and also SW path |
|
|
All for HW and also SW path |
|
|
|
|
|
|
|
|
All for intra-node, inter-node and inter-socket |
|
|
All for intra-node, inter-node and inter-socket |
|
|
|
|
|
|
|
|
--> conclude how much overhead DSA engine has |
|
|
--> conclude how much overhead DSA engine has |
|
|
|
|
|
|
|
|
--> conclude size after which using HW makes sense |
|
|
--> conclude size after which using HW makes sense |
|
|
this point is reached when submit overhead for |
|
|
this point is reached when submit overhead for |
|
|
hw execution is smaller than entire copy time |
|
|
hw execution is smaller than entire copy time |
|
|
for sw execution |
|
|
for sw execution |
|
|
|
|
|
|
|
|
# submit // done |
|
|
# submit // done |
|
|
|
|
|
|
|
|
- single submit-and-wait |
|
|
- single submit-and-wait |
|
|
- multi submit |
|
|
- multi submit |
|
|
- batch submit |
|
|
- batch submit |
|
|
|
|
|
|
|
|
All with both 1 and 4 engines per WQ |
|
|
All with both 1 and 4 engines per WQ |
|
|
|
|
|
|
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB |
|
|
All for 1KiB, 4KiB, 1MiB, 1GiB |
|
|
|
|
|
|
|
|
All only on DDR and intra-node |
|
|
All only on DDR and intra-node |
|
|
|
|
|
|
|
|
--> conclude which work submission strategy is best for which size |
|
|
--> conclude which work submission strategy is best for which size |
|
|
|
|
|
|
|
|
--> conclude whether multiple engines significantly improve batch perf |
|
|
--> conclude whether multiple engines significantly improve batch perf |
|
|
|
|
|
|
|
|
# mtsubmit // done |
|
|
# mtsubmit // done |
|
|
|
|
|
|
|
|
- multiple threads submit to the same WQ |
|
|
- multiple threads submit to the same WQ |
|
|
- use 1,2,4,8,12 threads |
|
|
- use 1,2,4,8,12 threads |
|
|
|
|
|
|
|
|
All using DDR and 1MiB |
|
|
All using DDR and 1MiB |
|
|
|
|
|
|
|
|
All for 1 vs 4 engines |
|
|
All for 1 vs 4 engines |
|
|
|
|
|
|
|
|
All on DDR and intra-node |
|
|
All on DDR and intra-node |
|
|
|
|
|
|
|
|
--> conclude how bad mt submit hurts performance |
|
|
--> conclude how bad mt submit hurts performance |
|
|
|
|
|
|
|
|
--> conclude whether multiple engines help mt submit |
|
|
--> conclude whether multiple engines help mt submit |
|
|
|
|
|
|
|
|
# cross-copy // done |
|
|
# cross-copy // done |
|
|
|
|
|
|
|
|
- compare which is faster: xcopy, copy from source node, copy from dst node |
|
|
- compare which is faster: xcopy, copy from source node, copy from dst node |
|
|
|
|
|
|
|
|
All for both inter-node and inter-socket copy using DDR and 1MiB on 4E |
|
|
All for both inter-node and inter-socket copy using DDR and 1MiB on 4E |
|
|
|
|
|
|
|
|
--> conclude where a copy thread should live |
|
|
--> conclude where a copy thread should live |
|
|
|
|
|
|