benchmark optimize(1)
source code: whetstone.c
based compiler flags: -std=c89 -DDP -DROLL -lm
no warning, no error
始めよう
GCC First:
1.simply run:
Rolled Double Precision 703148 Kflops ; 2048 Reps
2.703148 is too slow,then we add flag: -O4, optimize the loops,then compile again,run it:
Rolled Double Precision 4177105 Kflops ; 2048 Reps
better now!
now come to these flags:
gcc -std=c89 -DDP -DROLL -O4 -ffast-math -funroll-all-loops -mavx whetstone.c -fopenmp -lm -o b.out
fast-math means faster but sacrifices the accuracy
avx means using the avx instruction
5340310 Kflops now!
ICC THEN:
1.simply run:
Rolled Double Precision 4636137 Kflops ; 2048 Reps
seems good at first,if we add flag:-O3, the program isn’t faster at all,then we think about using parallel methods
flags -xHost can improve about 14%
2.parallel methods:
we have to run vtune_amplifier_xe above all,this software locate in /opt/intel/vtune_amplifier_xe_xxx/bin64, run /opt/intel/vtune_amplifier_xe_xxx/bin64/amplxe-gui and you will see the software window.(ps: xxx means the version of vtune_amplifier_xe)
run command(as root):
root# echo 0 > /proc/sys/kernel/yama/ptrace_scope
then refer to the tutorial:hotspots_amplxe_lin.pdf
it shows those hotspots:
it also shows the Utilization situation:
Poor!Now we have to consider to parallel it.
发表评论