Tuesday, December 8, 2009

OpenCL Kernel for Scalar Product with Atomic Operations

The final sum of the dotproduct example is implemented on CPU. This is a solution of Scalar Product (DotProduct) without final reduction on the host side. This example uses atomic operations.

* sDOT OpenCL Kernel Function for Level 1 BLAS Dot Product dot<-xy * Author; Wendell Rodrigues
* INRIA-Lille :: DaRT Team
__kernel void sDOT(
__global const unsigned int N,
__global const float* X,
__global const float* Y,
__global float* DOT,
__global int* FLAG,
__local float* sdata
// get index into global data array
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {


// do reduction in shared mem
for(unsigned int s=1; s < get_local_size(0); s *= 2)
int index = 2 * s * tid;

if (index < get_local_size(0))
sdata[index] += sdata[index + s];


// write result for this block to global mem
if (tid == 0) {
while (atom_cmpxchg(FLAG,0,1)==1);
DOT[0] += sdata[0];


Tuesday, November 10, 2009

Modeling Challenges on OpenCL Code Generation

This week, I presented an overview of the integration of OpenCL and Gaspard2. In order to overcome the many challenges of model conception and transformations, we need study a new MoC (other than Array-OL), coalescent memory allocation and task distribute. The slides have two good examples of OpenCL applications and some questions about model conception and code generation.


OpenWF - The Standard for building composited windowing systems

From Khronos Group:

Embedded devices are increasingly expected to offer sophisticated user interfaces that combine rich graphics with multimedia content. Graphics and display hardware technologies have evolved to achieve these visuals with significantly higher efficiency than traditional CPUs, delivering greater performance, decreasing memory bandwidth usage and increasing battery life. Making use of this variety of hardware introduces fragmentation as software needs to be adapted to each hardware configuration.

A platform’s Hardware Abstraction Layer (HAL) for display and graphics technology allows the applications and middleware layers above to be deployed across a range of hardware without costly porting activities. OpenGL is an example of a graphics HAL that allows portable software to take advantage of a wide range of 3D hardware accelerators.

Windowing systems allow screens to be shared by multiple applications, ensuring that the graphics provided for each application’s window is sensibly merged onto the screen. This requires the graphics and display drivers to respect the intentions of the windowing system, which commonly means considerable OS-specific porting work on the part of the device manufacturer when moving to new hardware.

The OpenWF APIs provide an OS-independent and hardware-neutral foundation for building compositing systems, particularly suited to implementing windowing systems. OpenWF acts as a HAL to achieve composition of content and configuration of display devices. The interfaces are designed for use by a single user which could be a central windowing system or, in an application-specific system, may be the application itself.


Friday, October 23, 2009

Conjugate Gradient and OpenCL

I've just finished a conjugate gradient implementation for OpenCL. It has not performance yet, but I'm working on this to fix the bugs and/or optimize the code.

Here you are a PDF that makes an overview on the subject.

Friday, October 16, 2009

Nvidia's Next Generation: Fermi - key architectural highlights

Third Generation Streaming Multiprocessor (SM)
  • 32 CUDA cores per SM, 4x over GT200
  • 8x the peak double precision floating point performance over GT200
  • Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps
  • 64 KB of RAM with a configurable partitioning of shared memory and L1 cache

Second Generation Parallel Thread Execution ISA
  • Unified Address Space with Full C++ Support
  • Optimized for OpenCL and DirectCompute
  • Full IEEE 754-2008 32-bit and 64-bit precision
  • Full 32-bit integer path with 64-bit extensions
  • Memory access instructions to support transition to 64-bit addressing
  • Improved Performance through Predication

Improved Memory Subsystem
  • NVIDIA Parallel DataCache™ hierarchy with Configurable L1 and Unified L2
  • Caches
  • First GPU with ECC memory support
  • Greatly improved atomic memory operation performance

NVIDIA GigaThread™ Engine
  • 10x faster application context switching
  • Concurrent kernel execution
  • Out of Order thread block execution
  • Dual overlapped memory transfer engines
more information: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

Thursday, October 15, 2009

ATI Stream Software Development Kit (SDK) v2.0 Beta Program

What’s New in v2.0-beta4

  • First beta release of ATI Stream SDK with OpenCL™ GPU support.
  • ATI Stream SDK v2.0 OpenCL™ is certified OpenCL™ 1.0 conformant by Khronos.
  • Added Microsoft® Windows® 7 support.
  • Added native Microsoft® Windows® 64-bit support.
  • Float comparisons in kernels no longer produce a runtime error.
  • Various other issues from previous v2.0 beta releases have been resolved.
More information: http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx

Thursday, October 8, 2009

OpenCL BLAS - Makefile for MAC

Thanks to Mario Rometsch for a version of OpenCL BLAS Makefile for MacOS. You can download it on the SourceForge.

OpenCL BLAS Makefile for MacOS

Monday, September 21, 2009

BLAS Library for OpenCL

I use the conjugate gradient solver without preconditioners to solve a linear system Ax=b, where A is a sparse matrix. This method is iterative and uses some BLAS functions like Dot Product, Scalar Product, xAXPY and xGEMV (SpMV for sparse matrix).I've started to develop these functions for the OpenCL language and I've decided to share them.

Right now, the following BLAS level 1 functions are available:
sDOT :: single precision dot product or scalar product (dot<-xy)
:: single precision vector 2-norm
sSCAL :: single precision product of vector by scalar (x<-ax)
:: single precision AXPY (y<-ax + y) You can download the OpenCL code which was tested on NVIDIA Tesla C870 and GPU Computing SDK 2.3

SourceForge Project

Please join up with your contribution!

Update: OpenCL BLAS now is a discontinued project.

Monday, September 14, 2009

ATI Stream Software Development Kit (SDK) v2.0 Beta Program With OpenCL™ 1.0 Support

With ATI Stream SDK, AMD/ATI provides a way to program OpenCL on its cards. I didn't download it yet, but you can get more information on:


and on the OpenCL/ATI forum:

I'm going to test it and I will post here an overview.

Friday, September 11, 2009

GPU and Matlab

If you like programming Matlab-like environment, I suggest the freeware GPUMat from GP-you Group. You can explore the power of GPUs, BLAS and FFT libraries on NVIDIA cards. You can get more information on:
GP-you Group

Thursday, August 27, 2009

Old Tesla C870

I have two cards in my desktop computer. Unfortunately, to execute OpenCL code it needs a 185.18.08 driver version. My video card is a Quadro NVS 280 and its driver is a 173 version. I've installed the 185 version and I needed execute this script:

modprobe nvidia

if [ "$?" -eq 0 ]; then

# Count the number of NVIDIA controllers found.
N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

N=`expr $N3D + $NVGA - 1`
for i in `seq 0 $N`; do
mknod -m 666 /dev/nvidia$i c 195 $i;

mknod -m 666 /dev/nvidiactl c 195 255

exit 1
However my video card doesn't work. I will post a solution, if I find it.

Tuesday, June 9, 2009

June, 4. Journée Jeunes Chercheurs sur les Multiprocesseurs et Multicoeurs (Overview)

Developing on GPU is a "hot" theme in Parallel Programming World. Here, I show you the main topics on this subject presented on Young Researchers on Multiprocessors and Multicores Journey in June, 4 at Paris.

  1. Sylvain Contassot-Vivier,"Iterative Asynchronous Algorithms on GPU Cluster"
  2. Thomas Jost, "Adaptation of Iterative Asynchronous Algorithms on GPU Cluster"
  3. Matthieu Ospici, "GPU Exploring and Sharing on Clusters of Hybrid Computation"
  4. Florent Calvayrac, "Precision and Performance Comparative on GPU Cluster for Different Algorithms for Physical-Chemical Numerical Computation"

Iterative Asynchronous Algorithms on GPU Cluster
Mr. Contassot-Vivier spoke about GPU Cluster and Asynchronous Algorithms. The GPELEC cluster is a 16 node cluster of GPUs and designed for computer science experimentation. It has been granted and bought by SUPÉLEC. Each node is a PC hosting a dual-core CPU and a GPU card: a nVIDIA GeForce 8800 GT, with 512MiB of RAM (on the GPU card). The 16 nodes are interconnected across a devoted Gigabit Ethernet switch. An Infiniband network is also available on half of the GPELEC cluster(on 8 nodes). Some Wattmeters have been installed on the GPELEC cluster (nodes and switches) in order to measure and analyse the energetic consumption, function of the computations run. Development environment available on GPELEC are mainly the gcc suite and its OpenMP library, OpenMPI and the CUDA environment of nVIDIA (nvcc compiler).

The objective of GPELEC platform was to quickly provide an experimental GPU cluster to researchers of SUPÉLEC and AlGorille in order to experiment scientific programming on GPU ("GPGPU"), and to track computing and energetic performances. In 2008 GPELEC has allowed to experiment the compatibility of MPI and CUDA frameworks, and to develop some fast Monte-Carlo simulations for an option pricing problem. Others developments and experimentations are planned in 2009 in collaboration with EDF researchers, and with our colleagues from CERMICS and MathFi INRIA team.

Adaptation of Iterative Asynchronous Algorithms on GPU Cluster
Their experiments are binded to the performance increase making better memory access. They got high performance using cache and memory alignments. They said that it was obtained a gain compared to CNC.

Precision and Performance Comparative on GPU Cluster for Different Algorithms for Physical-Chemical Numerical Computation
The great part of methods to compute on GPU is based on Direct and Iterative Methods to solve Linear Equation Systems. This speech treats of benchmark in already known methods.

If you want more information, contact me or the authors directly.

Monday, June 1, 2009

Overview of OpenCL and Code Generation

My research project comprehends the code generation for heterogenous parallel platforms. More precisely, GPU architectures. This is a little presentation that treats of OpenCL and its aspects and how to model data and task parallelism to generate a optimized code.


Wednesday, May 13, 2009

OpenCL 1.0 Conformance Candidate Release

We are pleased to announce the release of our OpenCL 1.0 Conformance Candidate to GPU Computing registered developers. You now have access to the OpenCL drivers we submitted this week to the Khronos OpenCL working group.

The release also includes several OpenCL SDK code samples and additional documentation to help you get started programming with OpenCL.

Please submit bug reports (and feature/extension requests) using the "Bug Report" link in top left when your are logged in. You may also ask questions and discuss this release and other OpenCL-related topics in the OpenCL developer forums, here:


It is recommended that you follow the installation instructions in the Release Notes for your platform. A driver update may be required, as noted in the release notes.

Please review the release notes carefully after installation, as this will allow for a much smoother introduction to the release. While this release can be used on a wide variety of NVIDIA products, only a subset were tested for this release.

High demand for these files may temporarily overwhelm our servers, so please be patient and try again tomorrow if your download does not successfully complete.

NOTICE: This release is made available to you under the terms and conditions of the end user license agreement (EULA) distributed with this release. If you do not accept the EULA, you do not have rights to use the files included in this release and must delete all copies of all files associated with this release immediately.

Source: Nvidia Online Update

Tuesday, April 21, 2009

First OpenCL demo

This demo of OpenCL on NVIDIA GPU was based on early non-released OpenCL API/driver interface. The nbody simulation shown was a simulation of massive particles under the influence of physical forces. Nbody simulation is known to be computationally intensive. The demo shows how OpenCL can deliver high performance computing using the parallel cores of the CUDA architecture on NVIDIA GPUs. This demo also illustrates the idea that core computational code can be written in OpenCL and will scale to whatever number of cores are available. The demo shown in Siggraph used the 32 parallel cores available on the NVIDIA Quadro FX 570M GPU that it ran on.

The first public demonstration of OpenCL running on a GPU was done by NVIDIA on December 12, 2008 at Siggraph Asia.


Monday, April 20, 2009

OpenCL and CUDA

NVIDIA's CUDA is, in my humble opinion, the most easy way to program GPU today. This little article guides you to write your first application on graphics processors from NVIDIA. I suppose that you have a good knowledge in C programming.

Step 1: Get CUDA
Download API

Step 2: Compile all the projects in ~/NVIDIA_CUDA_SDK/projects directory
do a make

Step 3: For testing only
execute deviceQuery

you can get a result like this:

There is 1 device supporting CUDA

Device 0: "Tesla C870"
Major revision number: 1
Minor revision number: 0
Total amount of global memory: 1610350592 bytes
Number of multiprocessors: 16
Number of cores: 128
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.35 GHz
Concurrent copy and execution: No


Step 4: Writing your code
  • you should make a code for CPU(host program) and a code for GPU (kernel)
  • write a file named "firstcuda.cu", for example:

#include <stdio.h>
#include <cuda.h>

__global__ square_array(float *a, int N)

idx = blockIdx.x * blockDim.x + threadIdx.x;
(idx < N) a[idx]=a[idx]*a[idx];

  • above is the device part, now you should write the host CPU part.
// main routine that executes on the host
int main(void)

*a_h, *a_d; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 3;
n_blocks = N/block_size + (N % block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);

Step 5: Compile and execute
$ nvcc -o firstcuda firstcuda.cu
$ ./firstcuda

0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000

And about OpenCL?
OpenCL is under Khronos Technology responsability and it was initially developed by Apple and its purpose is to be is a framework for writing programs that execute across heterogeneous platforms consisting of CPU, GPU and other processors. NVIDIA promises GPU compatibility.

A little conversion from CUDA code to OpenCL is presented below:

__kernel void //kernel code
square_array(__global const int N,
__global float * a)
// work item index, cuda thread index
int idx = get_global_id(0);
if (idx < N) a[idx] = a[idx]*a[idx];

The host program contains a collection of compute kernels and internal functions.

#include <stdio.h>
#include <cl.h>

// main routine that executes on the host
int main(void)
float *a_h; // Pointer to host & device arrays
const int N = 10; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host

// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;

// create the OpenCL context on a GPU device
cl_context context = clCreateContextFromType(0, // (must be 0)
CL_DEVICE_TYPE_ GPU, NULL, // error callback
NULL, // user data
NULL); // error code

// get the list of GPU devices associated with context
size_t cb;
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
cl_device_id *devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue
cl_cmd_queue cmd_queue = clCreateCommandQueue(context,
devices[0], 0, // default options
NULL); // error code

cl_mem memobjs[1];
// allocate input buffer memory objects
memobjs[0] = clCreateBuffer(context,
CL_MEM_READ_ONLY | // flags
sizeof(cl_float)*n, // size
a_h, // host pointer
NULL); // error code

// create the program
cl_program program = clCreateProgramWithSource(
1, // string count
&program_source, // program strings
NULL, // string lengths
NULL); // error code
// build the program
cl_int err = clBuildProgram(program,
0, // num devices in device list
NULL, // device list
NULL, // options
NULL, // notifier callback function ptr
NULL); // error code
// create the kernel
cl_kernel kernel = clCreateKernel(program, “square_array”, NULL);

// set “a” vector argument
err = clSetKernelArg(kernel,
0, // argument index
(void *)&memobjs[0], // argument data
sizeof(cl_mem)); // argument data size

size_t global_work_size[1] = n; // set work-item dimensions
// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel,
1, // Work dimensions
NULL, // must be NULL (work offset)
NULL, // automatic local work size
0, // no events to wait on
NULL, // event list
NULL); // event for this kernel

// read output array
err = clEnqueueReadBuffer( context, memobjs[0],
CL_TRUE, // blocking
0, // offset
n*sizeof(cl_float), // size
a_h, // pointer
0, NULL, NULL); // events

// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);

// Cleanup

So, that is it. We don't have any OpenCL compiler yet, but it will arrive soon.

Paper about Conjugate Gradient

I submitted a paper to Compumag 2009. In this paper, it is shown the main aspects of the conjugate gradient iterative method on GPU.

Paper (pdf)

Friday, April 17, 2009

Sparse Matrix Vector Product

This week, I made a presentation which explained some Sparse Matrix Vector product algorithms using CUDA.

PDF presentation.