benchmark optimize(1)

No Comments

source code: whetstone.c

based compiler flags: -std=c89 -DDP  -DROLL -lm

no warning, no error

始めよう

GCC First:

1.simply run:

Rolled Double  Precision 703148 Kflops ; 2048 Reps

2.703148 is too slow,then we add flag: -O4, optimize the loops,then compile again,run it:

Rolled Double  Precision 4177105 Kflops ; 2048 Reps

better now!

now come to these flags:

gcc -std=c89 -DDP  -DROLL -O4 -ffast-math -funroll-all-loops -mavx whetstone.c -fopenmp -lm -o b.out

fast-math means faster but sacrifices the accuracy

avx means using the avx instruction
5340310 Kflops now!

 

ICC THEN:

1.simply run:

Rolled Double  Precision 4636137 Kflops ; 2048 Reps

seems good at first,if we add flag:-O3, the program isn’t faster at all,then we think about using parallel methods

flags -xHost can improve about 14%

2.parallel methods:

we have to run vtune_amplifier_xe above all,this software locate in /opt/intel/vtune_amplifier_xe_xxx/bin64, run /opt/intel/vtune_amplifier_xe_xxx/bin64/amplxe-gui and you will see the software window.(ps: xxx means the version of vtune_amplifier_xe)

run command(as root):

root# echo 0 > /proc/sys/kernel/yama/ptrace_scope

then refer to the tutorial:hotspots_amplxe_lin.pdf

it shows those hotspots:

 DeepinScreenshot20160229191414

it also shows the Utilization situation:

DeepinScreenshot20160229191713

Poor!Now we have to consider to parallel it.

Categories: Programming

发表评论

邮箱地址不会被公开。 必填项已用*标注

*

code