Step 1: Get CUDA
Download API
Step 2: Compile all the projects in ~/NVIDIA_CUDA_SDK/projects directory
do a make
Step 3: For testing only
execute deviceQuery
you can get a result like this:
There is 1 device supporting CUDA
Device 0: "Tesla C870"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No
Test PASSED
Step 4: Writing your code
- you should make a code for CPU(host program) and a code for GPU (kernel)
- write a file named "firstcuda.cu", for example:
#include <stdio.h>
#include <cuda.h>
__global__ square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) a[idx]=a[idx]*a[idx];
}
- above is the device part, now you should write the host CPU part.
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 3;
int n_blocks = N/block_size + (N % block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
Step 5: Compile and execute
$ nvcc -o firstcuda firstcuda.cu
$ ./firstcuda
0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000
And about OpenCL?
OpenCL is under Khronos Technology responsability and it was initially developed by Apple and its purpose is to be is a framework for writing programs that execute across heterogeneous platforms consisting of CPU, GPU and other processors. NVIDIA promises GPU compatibility.
A little conversion from CUDA code to OpenCL is presented below:
__kernel void //kernel code
square_array(__global const int N,
__global float * a)
{
// work item index, cuda thread index
int idx = get_global_id(0);
if (idx < N) a[idx] = a[idx]*a[idx];
}
The host program contains a collection of compute kernels and internal functions.
#include <stdio.h>
#include <cl.h>
// main routine that executes on the host
int main(void)
{
float *a_h; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
// create the OpenCL context on a GPU device
cl_context context = clCreateContextFromType(0, // (must be 0)
CL_DEVICE_TYPE_ GPU, NULL, // error callback
NULL, // user data
NULL); // error code
// get the list of GPU devices associated with context
size_t cb;
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
cl_device_id *devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);
// create a command-queue
cl_cmd_queue cmd_queue = clCreateCommandQueue(context,
devices[0], 0, // default options
NULL); // error code
cl_mem memobjs[1];
// allocate input buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | // flags
CL_MEM_COPY_HOST_PTR,
sizeof(cl_float)*n, // size
a_h, // host pointer
NULL); // error code
// create the program
cl_program program = clCreateProgramWithSource(
context,
1, // string count
&program_source, // program strings
NULL, // string lengths
NULL); // error code
// build the program
cl_int err = clBuildProgram(program,
0, // num devices in device list
NULL, // device list
NULL, // options
NULL, // notifier callback function ptr
NULL); // error code
// create the kernel
cl_kernel kernel = clCreateKernel(program, “square_array”, NULL);
// set “a” vector argument
err = clSetKernelArg(kernel,
0, // argument index
(void *)&memobjs[0], // argument data
sizeof(cl_mem)); // argument data size
size_t global_work_size[1] = n; // set work-item dimensions
// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel,
1, // Work dimensions
NULL, // must be NULL (work offset)
global_work_size,
NULL, // automatic local work size
0, // no events to wait on
NULL, // event list
NULL); // event for this kernel
// read output array
err = clEnqueueReadBuffer( context, memobjs[0],
CL_TRUE, // blocking
0, // offset
n*sizeof(cl_float), // size
a_h, // pointer
0, NULL, NULL); // events
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h);
}
So, that is it. We don't have any OpenCL compiler yet, but it will arrive soon.
2 comments:
I am glad find out the blog which focus about CUDA and OpenCL. I plan to continue my study at master degree, and my research area is image processing/computer vision- based porn filtering in video sequence. Beside accuracy detection, the processing must be fast ( in real time ). So, I intend to use CUDA or OpenCL, but I am a novice. Could you give me some advice about that, especially in system requirements.
Well, ATI and Nvidia have already an API to develop OpenCL. If u need more information how to install and use these APIs, you can contact me.
Post a Comment