Tuesday, April 21, 2009

First OpenCL demo

This demo of OpenCL on NVIDIA GPU was based on early non-released OpenCL API/driver interface. The nbody simulation shown was a simulation of massive particles under the influence of physical forces. Nbody simulation is known to be computationally intensive. The demo shows how OpenCL can deliver high performance computing using the parallel cores of the CUDA architecture on NVIDIA GPUs. This demo also illustrates the idea that core computational code can be written in OpenCL and will scale to whatever number of cores are available. The demo shown in Siggraph used the 32 parallel cores available on the NVIDIA Quadro FX 570M GPU that it ran on.

The first public demonstration of OpenCL running on a GPU was done by NVIDIA on December 12, 2008 at Siggraph Asia.


Monday, April 20, 2009

OpenCL and CUDA

NVIDIA's CUDA is, in my humble opinion, the most easy way to program GPU today. This little article guides you to write your first application on graphics processors from NVIDIA. I suppose that you have a good knowledge in C programming.



Step 1: Get CUDA
Download API

Step 2: Compile all the projects in ~/NVIDIA_CUDA_SDK/projects directory
do a make

Step 3: For testing only
execute deviceQuery

you can get a result like this:

There is 1 device supporting CUDA

Device 0: "Tesla C870"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No

Test PASSED


Step 4: Writing your code
  • you should make a code for CPU(host program) and a code for GPU (kernel)
  • write a file named "firstcuda.cu", for example:

#include <stdio.h>
#include <cuda.h>

__global__ square_array(float *a, int N)
{

int
idx = blockIdx.x * blockDim.x + threadIdx.x;
if
(idx < N) a[idx]=a[idx]*a[idx];
}

  • above is the device part, now you should write the host CPU part.
// main routine that executes on the host
int main(void)
{

float
*a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 3;
int
n_blocks = N/block_size + (N % block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}


Step 5: Compile and execute
$ nvcc -o firstcuda firstcuda.cu
$ ./firstcuda


0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000


And about OpenCL?
OpenCL is under Khronos Technology responsability and it was initially developed by Apple and its purpose is to be is a framework for writing programs that execute across heterogeneous platforms consisting of CPU, GPU and other processors. NVIDIA promises GPU compatibility.

A little conversion from CUDA code to OpenCL is presented below:


__kernel void //kernel code
square_array(__global const int N,
__global float * a)
{
// work item index, cuda thread index
int idx = get_global_id(0);
if (idx < N) a[idx] = a[idx]*a[idx];
}



The host program contains a collection of compute kernels and internal functions.



#include <stdio.h>
#include <cl.h>

// main routine that executes on the host
int main(void)
{
float *a_h; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host

// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;

// create the OpenCL context on a GPU device
cl_context context = clCreateContextFromType(0, // (must be 0)
CL_DEVICE_TYPE_ GPU, NULL, // error callback
NULL, // user data
NULL); // error code

// get the list of GPU devices associated with context
size_t cb;
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
cl_device_id *devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue
cl_cmd_queue cmd_queue = clCreateCommandQueue(context,
devices[0], 0, // default options
NULL); // error code

cl_mem memobjs[1];
// allocate input buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | // flags
CL_MEM_COPY_HOST_PTR,
sizeof(cl_float)*n, // size
a_h, // host pointer
NULL); // error code

// create the program
cl_program program = clCreateProgramWithSource(
context,
1, // string count
&program_source, // program strings
NULL, // string lengths
NULL); // error code
// build the program
cl_int err = clBuildProgram(program,
0, // num devices in device list
NULL, // device list
NULL, // options
NULL, // notifier callback function ptr
NULL); // error code
// create the kernel
cl_kernel kernel = clCreateKernel(program, “square_array”, NULL);

// set “a” vector argument
err = clSetKernelArg(kernel,
0, // argument index
(void *)&memobjs[0], // argument data
sizeof(cl_mem)); // argument data size

size_t global_work_size[1] = n; // set work-item dimensions
// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel,
1, // Work dimensions
NULL, // must be NULL (work offset)
global_work_size,
NULL, // automatic local work size
0, // no events to wait on
NULL, // event list
NULL); // event for this kernel

// read output array
err = clEnqueueReadBuffer( context, memobjs[0],
CL_TRUE, // blocking
0, // offset
n*sizeof(cl_float), // size
a_h, // pointer
0, NULL, NULL); // events

// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);

// Cleanup
free(a_h);
}




So, that is it. We don't have any OpenCL compiler yet, but it will arrive soon.

Paper about Conjugate Gradient

I submitted a paper to Compumag 2009. In this paper, it is shown the main aspects of the conjugate gradient iterative method on GPU.


Paper (pdf)

Friday, April 17, 2009

Sparse Matrix Vector Product

This week, I made a presentation which explained some Sparse Matrix Vector product algorithms using CUDA.

PDF presentation.