CUDA learning(2)–simple parallelism cuda program

now we use the function add,with these codes:
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

add() runs on the device, so a, b and c must point to device memory

but we can allocate memory on the GPU

we can use cudaMalloc(), cudaFree(), cudaMemcpy() to handle device memory

now comes with a simple program:

#include <stdio.h>
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main(void)
{
int a, b, c;
int *d_a, *d_b, *d_c;
int size = sizeof(int);
// host copies of a, b, c
// device copies of a, b, c
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
a = 2;
b = 7;
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

So how do we run code in parallel on the device?

change “add<<< 1, 1 >>>();” to “add<<< N, 1 >>>();”(Instead of executing add() once, execute N times in parallel,N means N blocks)

Vector Addition on the Device

Terminology: each parallel invocation of add() is referred to as a block .Each invocation can refer to its block index using blockIdx.x

then we change the add function:

__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

 

so they change to three arrays,so something has to be changed in main()

maybe we can use function random_ints()

CUDA learning(1)–start from hello-world

learn from http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf

first comes some concepts:

  • Heterogeneous Computing
  • Blocks
  • Threads
  • Indexing
  • Shared memory
  • __syncthreads()
  • Asynchronous operation
  • Handling errors
  • Managing devices

Now we have to understand what “Heterogeneous Computing” is.From the 8th page,we can easily realize that it contains device code,host code,parallel code and serial code.And the next content is more intuitive.Data is around CPU and GPU,and codes are execurated in GPU.

 

(ps: GigaThread™ :this engine provides up to 10x faster context switching compared to previous generation architectures, concurrent kernel execution, and improved thread block scheduling.)

so we add the device code,then the “hello world” program looks like this :

#include //do not forget this!
__global__ void mykernel(void) {

}

int main(void)

{

mykernel<<<1,1>>>();

printf("Hello World!\n");

return 0;

}

__global__: cuda C/C++ keyword,indicates:

  • Runs on the device
  • Is called from host code

and nvcc separates source code into host and device components(where’s comment nvcc?)

Device functions (e.g. mykernel()) processed by NVIDIA compiler and Host functions (e.g. main()) processed by standard host compiler

(ps:sorry to tell you that for archlinux users, you can install nvidia toolkit by “pacman -S cuda

and maybe you have to restart to use nvcc)

function mykernel does nothing here,and we’ll tell what “<<<1,1>>>” does in a moment.

now I have to accomplish my parallelism learning that was left before:

reference book: CSAPP

(ps: found nothing worth to write now ,maybe I’ll write something in subsequent sections)