Dilbert

Tuesday, December 8, 2009

OpenCL Kernel for Scalar Product with Atomic Operations

The final sum of the dotproduct example is implemented on CPU. This is a solution of Scalar Product (DotProduct) without final reduction on the host side. This example uses atomic operations.


/*
* sDOT OpenCL Kernel Function for Level 1 BLAS Dot Product dot<-xy * Author; Wendell Rodrigues
* INRIA-Lille :: DaRT Team
*/
__kernel void sDOT(
__global const unsigned int N,
__global const float* X,
__global const float* Y,
__global float* DOT,
__global int* FLAG,
__local float* sdata
)
{
// get index into global data array
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {
DOT[0]=0;
*FLAG=0;
}

barrier(CLK_LOCAL_MEM_FENCE);


// do reduction in shared mem
for(unsigned int s=1; s < get_local_size(0); s *= 2)
{
int index = 2 * s * tid;

if (index < get_local_size(0))
{
sdata[index] += sdata[index + s];
}

barrier(CLK_LOCAL_MEM_FENCE);
}

// write result for this block to global mem
if (tid == 0) {
while (atom_cmpxchg(FLAG,0,1)==1);
DOT[0] += sdata[0];
atom_cmpxchg(FLAG,1,0);
}

}

5 comments:

Cravo e Canela said...

Obrigada pela visita!!
Entao vc consegue visualizar neh? hahaha
Pois entao, depois que o Antonio ficou dodoi as mulheres se acharam no direito de tentar ser minhas amigas,pior de infancia. E isso eh para servir em quem se identificar.Cheirim Frances..:)

黃昏 said...
This comment has been removed by a blog administrator.
chat said...

若對自己誠實,日積月累,就無法對別人不忠了。........................................

Frederic said...

Hi!

This Compare and swap method looks really great but are you sure it works?

I always get an out of resource error...

greetings

Anonymous said...

I think this code is incorrect in some cases and sub-optimal. Sub-optimal, because the actual reduction interleaves threads in a wave front/warp that work and wait instead of grouping them together (see the NVIDIA SDK reduction example for a better loop).
Incorrect, because it lacks a read and a write memory fence ([read|write]_mem_fence() funcs) in the reduction loop. The unlocking atomic does not solve this and can be a normal write when properly fenced. The barrier inserted after the FLAG init also does not guarantee proper init for every thread.
Don't know about the resource error Frederic mentions.
(I know this entry is old, but it tops google for certain queries.)