Monday, September 21, 2009

BLAS Library for OpenCL

I use the conjugate gradient solver without preconditioners to solve a linear system Ax=b, where A is a sparse matrix. This method is iterative and uses some BLAS functions like Dot Product, Scalar Product, xAXPY and xGEMV (SpMV for sparse matrix).I've started to develop these functions for the OpenCL language and I've decided to share them.

Right now, the following BLAS level 1 functions are available:
sDOT :: single precision dot product or scalar product (dot<-xy)
:: single precision vector 2-norm
sSCAL :: single precision product of vector by scalar (x<-ax)
:: single precision AXPY (y<-ax + y) You can download the OpenCL code which was tested on NVIDIA Tesla C870 and GPU Computing SDK 2.3

SourceForge Project

Please join up with your contribution!

Update: OpenCL BLAS now is a discontinued project.


Anonymous said...

Thank you, code is great!
To even increase performance, I suggest replace blockSize testing in OpenCL code with preprocessor tests?

Wendell Rodrigues said...

Good idea. I will make the changes. ;) Thanks.

Anonymous said...

clBuildProgram has a handy argument "options" where you can pass the preprocessor defines, instead of doing string manipulation on code.

Do you plan on writing/releasing the whole sparse matrix CG solver code?

Wendell Rodrigues said...

Hi, I've CUDA CG code that works...if you want it...I send it to u.

Martin said...

Ehm. Does this even work? Tried running it on two vectors with 75 elements in each (75...1) and did not get the result I expected.

Wendell Rodrigues said...

Sorry about this...It's a beta code. There are problems that I need to fix (you can help me to fix it, if you want). How many threads and what is the size of your grid? I've troubles with the parallel reduction in dot product and norm. I've one OpenCL Conjugate Gradient version where these bugs were fixed. But its performance is slow and the final reduction is made on the host side.

Martin said...

Had a look at the source and it's clear that the dot product won't work as it is since all thread blocks write the local result to the same location in global memory (DOT[0]). I believe some form of loop would be required that reduces the number of work blocks as the calculation progresses.

Unfortunately, I don't have time too look more into this at the moment. Do you have a timeline for the development of this library?

Wendell Rodrigues said...

the syncro between two threads from different groups is impossible. the code is correct, but it's missing the final sum in the host side (and the threads and grid size contraints). I'll fix it next week and I'll add more functions as well.

Anonymous said...

we have a library of iterative solvers written in CUDA.

The interesting feature is that it has some preconditioners implemented.

can it be useful for your effort? Porting it to openCL would be very interesting also for us!

well ... let me know if it may be interesting

my e-mail is:

greetings from spain

der said...


I already wrote you an email.
I'm also planning to write an opencl cg-solver and would like to ask you if you could send me your code.
Perhaps I can help to test and develop it.

Greeting from germany,
Jens. (

Anonymous said...

Hey, crossed your blogg - check out Libra SDK, can be downloaded from a company called GPU Systems.

They support BLAS(dense&sparse), fft, complex numbers, single&double precision etc...some cool building blocks for advanced algorithms&solvers...

Its basically matlab in a C++ environment for CPUs & GPUs...very cool stuff...

Anonymous said...

Hi I have a sparse linear system with dimension of more than 1 million (Laplacian matrix). I need a fast solver to solve this. Which solver can I use?