Monday, April 20, 2009

OpenCL and CUDA

NVIDIA's CUDA is, in my humble opinion, the most easy way to program GPU today. This little article guides you to write your first application on graphics processors from NVIDIA. I suppose that you have a good knowledge in C programming.



Step 1: Get CUDA
Download API

Step 2: Compile all the projects in ~/NVIDIA_CUDA_SDK/projects directory
do a make

Step 3: For testing only
execute deviceQuery

you can get a result like this:

There is 1 device supporting CUDA

Device 0: "Tesla C870"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No

Test PASSED


Step 4: Writing your code
  • you should make a code for CPU(host program) and a code for GPU (kernel)
  • write a file named "firstcuda.cu", for example:

#include <stdio.h>
#include <cuda.h>

__global__ square_array(float *a, int N)
{

int
idx = blockIdx.x * blockDim.x + threadIdx.x;
if
(idx < N) a[idx]=a[idx]*a[idx];
}

  • above is the device part, now you should write the host CPU part.
// main routine that executes on the host
int main(void)
{

float
*a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 3;
int
n_blocks = N/block_size + (N % block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}


Step 5: Compile and execute
$ nvcc -o firstcuda firstcuda.cu
$ ./firstcuda


0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000


And about OpenCL?
OpenCL is under Khronos Technology responsability and it was initially developed by Apple and its purpose is to be is a framework for writing programs that execute across heterogeneous platforms consisting of CPU, GPU and other processors. NVIDIA promises GPU compatibility.

A little conversion from CUDA code to OpenCL is presented below:


__kernel void //kernel code
square_array(__global const int N,
__global float * a)
{
// work item index, cuda thread index
int idx = get_global_id(0);
if (idx < N) a[idx] = a[idx]*a[idx];
}



The host program contains a collection of compute kernels and internal functions.



#include <stdio.h>
#include <cl.h>

// main routine that executes on the host
int main(void)
{
float *a_h; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host

// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;

// create the OpenCL context on a GPU device
cl_context context = clCreateContextFromType(0, // (must be 0)
CL_DEVICE_TYPE_ GPU, NULL, // error callback
NULL, // user data
NULL); // error code

// get the list of GPU devices associated with context
size_t cb;
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
cl_device_id *devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue
cl_cmd_queue cmd_queue = clCreateCommandQueue(context,
devices[0], 0, // default options
NULL); // error code

cl_mem memobjs[1];
// allocate input buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | // flags
CL_MEM_COPY_HOST_PTR,
sizeof(cl_float)*n, // size
a_h, // host pointer
NULL); // error code

// create the program
cl_program program = clCreateProgramWithSource(
context,
1, // string count
&program_source, // program strings
NULL, // string lengths
NULL); // error code
// build the program
cl_int err = clBuildProgram(program,
0, // num devices in device list
NULL, // device list
NULL, // options
NULL, // notifier callback function ptr
NULL); // error code
// create the kernel
cl_kernel kernel = clCreateKernel(program, “square_array”, NULL);

// set “a” vector argument
err = clSetKernelArg(kernel,
0, // argument index
(void *)&memobjs[0], // argument data
sizeof(cl_mem)); // argument data size

size_t global_work_size[1] = n; // set work-item dimensions
// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel,
1, // Work dimensions
NULL, // must be NULL (work offset)
global_work_size,
NULL, // automatic local work size
0, // no events to wait on
NULL, // event list
NULL); // event for this kernel

// read output array
err = clEnqueueReadBuffer( context, memobjs[0],
CL_TRUE, // blocking
0, // offset
n*sizeof(cl_float), // size
a_h, // pointer
0, NULL, NULL); // events

// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);

// Cleanup
free(a_h);
}




So, that is it. We don't have any OpenCL compiler yet, but it will arrive soon.

2 comments:

Rudy CHiP said...

I am glad find out the blog which focus about CUDA and OpenCL. I plan to continue my study at master degree, and my research area is image processing/computer vision- based porn filtering in video sequence. Beside accuracy detection, the processing must be fast ( in real time ). So, I intend to use CUDA or OpenCL, but I am a novice. Could you give me some advice about that, especially in system requirements.

Wendell Rodrigues said...

Well, ATI and Nvidia have already an API to develop OpenCL. If u need more information how to install and use these APIs, you can contact me.