三月记录(不断更新)

    No Comments

    记录下2016年3月的每日生活,从24日开始记录

    24日:继续看scikit-learn这个python库,继续学习机器学习导论的东西和部分线代,看了coursera的部分内容,gradient descent和normal equation

    25日:组成原理,写了一篇博客(未写完)

    26日:汇编语言+机器学习

    27日:机器学习,背单词,看sentdex图像处理视频,写汇编报告

    Categories: 未分类

    benchmark optimize(1)

    No Comments

    source code: whetstone.c

    based compiler flags: -std=c89 -DDP  -DROLL -lm

    no warning, no error

    始めよう

    GCC First:

    1.simply run:

    Rolled Double  Precision 703148 Kflops ; 2048 Reps

    2.703148 is too slow,then we add flag: -O4, optimize the loops,then compile again,run it:

    Rolled Double  Precision 4177105 Kflops ; 2048 Reps

    better now!

    now come to these flags:

    gcc -std=c89 -DDP  -DROLL -O4 -ffast-math -funroll-all-loops -mavx whetstone.c -fopenmp -lm -o b.out

    fast-math means faster but sacrifices the accuracy

    avx means using the avx instruction
    5340310 Kflops now!

     

    ICC THEN:

    1.simply run:

    Rolled Double  Precision 4636137 Kflops ; 2048 Reps

    seems good at first,if we add flag:-O3, the program isn’t faster at all,then we think about using parallel methods

    flags -xHost can improve about 14%

    2.parallel methods:

    we have to run vtune_amplifier_xe above all,this software locate in /opt/intel/vtune_amplifier_xe_xxx/bin64, run /opt/intel/vtune_amplifier_xe_xxx/bin64/amplxe-gui and you will see the software window.(ps: xxx means the version of vtune_amplifier_xe)

    run command(as root):

    root# echo 0 > /proc/sys/kernel/yama/ptrace_scope

    then refer to the tutorial:hotspots_amplxe_lin.pdf

    it shows those hotspots:

     DeepinScreenshot20160229191414

    it also shows the Utilization situation:

    DeepinScreenshot20160229191713

    Poor!Now we have to consider to parallel it.

    Categories: Programming

    CUDA learning(2)–simple parallelism cuda program

    No Comments

    now we use the function add,with these codes:
    __global__ void add(int *a, int *b, int *c) {
    *c = *a + *b;
    }

    add() runs on the device, so a, b and c must point to device memory

    but we can allocate memory on the GPU

    we can use cudaMalloc(), cudaFree(), cudaMemcpy() to handle device memory

    now comes with a simple program:

    #include <stdio.h>
    __global__ void add(int *a, int *b, int *c)
    {
    *c = *a + *b;
    }
    int main(void)
    {
    int a, b, c;
    int *d_a, *d_b, *d_c;
    int size = sizeof(int);
    // host copies of a, b, c
    // device copies of a, b, c
    // Allocate space for device copies of a, b, c
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);
    a = 2;
    b = 7;
    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
    // Launch add() kernel on GPU
    add<<<1,1>>>(d_a, d_b, d_c);
    // Copy result back to host
    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
    }

    So how do we run code in parallel on the device?

    change “add<<< 1, 1 >>>();” to “add<<< N, 1 >>>();”(Instead of executing add() once, execute N times in parallel,N means N blocks)

    Vector Addition on the Device

    Terminology: each parallel invocation of add() is referred to as a block .Each invocation can refer to its block index using blockIdx.x

    then we change the add function:

    __global__ void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
    }

     

    so they change to three arrays,so something has to be changed in main()

    maybe we can use function random_ints()

    Categories: Programming Tags: Tags:

    CUDA learning(1)–start from hello-world

    No Comments

    learn from http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf

    first comes some concepts:

    • Heterogeneous Computing
    • Blocks
    • Threads
    • Indexing
    • Shared memory
    • __syncthreads()
    • Asynchronous operation
    • Handling errors
    • Managing devices

    Now we have to understand what “Heterogeneous Computing” is.From the 8th page,we can easily realize that it contains device code,host code,parallel code and serial code.And the next content is more intuitive.Data is around CPU and GPU,and codes are execurated in GPU.

     

    (ps: GigaThread™ :this engine provides up to 10x faster context switching compared to previous generation architectures, concurrent kernel execution, and improved thread block scheduling.)

    so we add the device code,then the “hello world” program looks like this :

    #include //do not forget this!
    __global__ void mykernel(void) {

    }

    int main(void)

    {

    mykernel<<<1,1>>>();

    printf("Hello World!\n");

    return 0;

    }

    __global__: cuda C/C++ keyword,indicates:

    • Runs on the device
    • Is called from host code

    and nvcc separates source code into host and device components(where’s comment nvcc?)

    Device functions (e.g. mykernel()) processed by NVIDIA compiler and Host functions (e.g. main()) processed by standard host compiler

    (ps:sorry to tell you that for archlinux users, you can install nvidia toolkit by “pacman -S cuda

    and maybe you have to restart to use nvcc)

    function mykernel does nothing here,and we’ll tell what “<<<1,1>>>” does in a moment.

    now I have to accomplish my parallelism learning that was left before:

    reference book: CSAPP

    (ps: found nothing worth to write now ,maybe I’ll write something in subsequent sections)

     

     

    Categories: Programming Tags: Tags: