CUDA learning(2)–simple parallelism cuda program

    No Comments

    now we use the function add,with these codes:
    __global__ void add(int *a, int *b, int *c) {
    *c = *a + *b;

    add() runs on the device, so a, b and c must point to device memory

    but we can allocate memory on the GPU

    we can use cudaMalloc(), cudaFree(), cudaMemcpy() to handle device memory

    now comes with a simple program:

    #include <stdio.h>
    __global__ void add(int *a, int *b, int *c)
    *c = *a + *b;
    int main(void)
    int a, b, c;
    int *d_a, *d_b, *d_c;
    int size = sizeof(int);
    // host copies of a, b, c
    // device copies of a, b, c
    // Allocate space for device copies of a, b, c
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);
    a = 2;
    b = 7;
    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
    // Launch add() kernel on GPU
    add<<<1,1>>>(d_a, d_b, d_c);
    // Copy result back to host
    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;

    So how do we run code in parallel on the device?

    change “add<<< 1, 1 >>>();” to “add<<< N, 1 >>>();”(Instead of executing add() once, execute N times in parallel,N means N blocks)

    Vector Addition on the Device

    Terminology: each parallel invocation of add() is referred to as a block .Each invocation can refer to its block index using blockIdx.x

    then we change the add function:

    __global__ void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];


    so they change to three arrays,so something has to be changed in main()

    maybe we can use function random_ints()

    Categories: Programming Tags: Tags:

    CUDA learning(1)–start from hello-world

    No Comments

    learn from

    first comes some concepts:

    • Heterogeneous Computing
    • Blocks
    • Threads
    • Indexing
    • Shared memory
    • __syncthreads()
    • Asynchronous operation
    • Handling errors
    • Managing devices

    Now we have to understand what “Heterogeneous Computing” is.From the 8th page,we can easily realize that it contains device code,host code,parallel code and serial code.And the next content is more intuitive.Data is around CPU and GPU,and codes are execurated in GPU.


    (ps: GigaThread™ :this engine provides up to 10x faster context switching compared to previous generation architectures, concurrent kernel execution, and improved thread block scheduling.)

    so we add the device code,then the “hello world” program looks like this :

    #include //do not forget this!
    __global__ void mykernel(void) {


    int main(void)



    printf("Hello World!\n");

    return 0;


    __global__: cuda C/C++ keyword,indicates:

    • Runs on the device
    • Is called from host code

    and nvcc separates source code into host and device components(where’s comment nvcc?)

    Device functions (e.g. mykernel()) processed by NVIDIA compiler and Host functions (e.g. main()) processed by standard host compiler

    (ps:sorry to tell you that for archlinux users, you can install nvidia toolkit by “pacman -S cuda

    and maybe you have to restart to use nvcc)

    function mykernel does nothing here,and we’ll tell what “<<<1,1>>>” does in a moment.

    now I have to accomplish my parallelism learning that was left before:

    reference book: CSAPP

    (ps: found nothing worth to write now ,maybe I’ll write something in subsequent sections)



    Categories: Programming Tags: Tags: