Right now, the following BLAS level 1 functions are available:

sDOT :: single precision dot product or scalar product (dot<-xy)

sNRM2 :: single precision vector 2-norm

sSCAL :: single precision product of vector by scalar (x<-ax)

sAXPY :: single precision AXPY (y<-ax + y) You can download the OpenCL code which was tested on NVIDIA Tesla C870 and GPU Computing SDK 2.3

SourceForge Project

Please join up with your contribution!

Update: OpenCL BLAS now is a discontinued project.

## 12 comments:

Thank you, code is great!

To even increase performance, I suggest replace blockSize testing in OpenCL code with preprocessor tests?

Good idea. I will make the changes. ;) Thanks.

clBuildProgram has a handy argument "options" where you can pass the preprocessor defines, instead of doing string manipulation on code.

Do you plan on writing/releasing the whole sparse matrix CG solver code?

Hi, I've CUDA CG code that works...if you want it...I send it to u.

Ehm. Does this even work? Tried running it on two vectors with 75 elements in each (75...1) and did not get the result I expected.

Sorry about this...It's a beta code. There are problems that I need to fix (you can help me to fix it, if you want). How many threads and what is the size of your grid? I've troubles with the parallel reduction in dot product and norm. I've one OpenCL Conjugate Gradient version where these bugs were fixed. But its performance is slow and the final reduction is made on the host side.

Had a look at the source and it's clear that the dot product won't work as it is since all thread blocks write the local result to the same location in global memory (DOT[0]). I believe some form of loop would be required that reduces the number of work blocks as the calculation progresses.

Unfortunately, I don't have time too look more into this at the moment. Do you have a timeline for the development of this library?

the syncro between two threads from different groups is impossible. the code is correct, but it's missing the final sum in the host side (and the threads and grid size contraints). I'll fix it next week and I'll add more functions as well.

Hi,

we have a library of iterative solvers written in CUDA.

The interesting feature is that it has some preconditioners implemented.

can it be useful for your effort? Porting it to openCL would be very interesting also for us!

well ... let me know if it may be interesting

my e-mail is:

rrossi@cimne.upc.edu

greetings from spain

Riccardo

Hi,

I already wrote you an email.

I'm also planning to write an opencl cg-solver and would like to ask you if you could send me your code.

Perhaps I can help to test and develop it.

Greeting from germany,

Jens. (trommler@temf.tu-darmstadt.de)

Hey, crossed your blogg - check out Libra SDK, can be downloaded from a company called GPU Systems.

They support BLAS(dense&sparse), fft, complex numbers, single&double precision etc...some cool building blocks for advanced algorithms&solvers...

Its basically matlab in a C++ environment for CPUs & GPUs...very cool stuff...

Hi I have a sparse linear system with dimension of more than 1 million (Laplacian matrix). I need a fast solver to solve this. Which solver can I use?

Post a Comment