/*

* sDOT OpenCL Kernel Function for Level 1 BLAS Dot Product dot<-xy * Author; Wendell Rodrigues

* INRIA-Lille :: DaRT Team

*/

__kernel void sDOT(

__global const unsigned int N,

__global const float* X,

__global const float* Y,

__global float* DOT,

__global int* FLAG,

__local float* sdata

)

{

// get index into global data array

unsigned int tid = get_local_id(0);

unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {

DOT[0]=0;

*FLAG=0;

}

barrier(CLK_LOCAL_MEM_FENCE);

// do reduction in shared mem

for(unsigned int s=1; s < get_local_size(0); s *= 2)

{

int index = 2 * s * tid;

if (index < get_local_size(0))

{

sdata[index] += sdata[index + s];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

// write result for this block to global mem

if (tid == 0) {

while (atom_cmpxchg(FLAG,0,1)==1);

DOT[0] += sdata[0];

atom_cmpxchg(FLAG,1,0);

}

}

## Tuesday, December 8, 2009

### OpenCL Kernel for Scalar Product with Atomic Operations

The final sum of the dotproduct example is implemented on CPU. This is a solution of Scalar Product (DotProduct) without final reduction on the host side. This example uses atomic operations.

Subscribe to:
Post Comments (Atom)

## 5 comments:

Obrigada pela visita!!

Entao vc consegue visualizar neh? hahaha

Pois entao, depois que o Antonio ficou dodoi as mulheres se acharam no direito de tentar ser minhas amigas,pior de infancia. E isso eh para servir em quem se identificar.Cheirim Frances..:)

若對自己誠實，日積月累，就無法對別人不忠了。........................................

Hi!

This Compare and swap method looks really great but are you sure it works?

I always get an out of resource error...

greetings

I think this code is incorrect in some cases and sub-optimal. Sub-optimal, because the actual reduction interleaves threads in a wave front/warp that work and wait instead of grouping them together (see the NVIDIA SDK reduction example for a better loop).

Incorrect, because it lacks a read and a write memory fence ([read|write]_mem_fence() funcs) in the reduction loop. The unlocking atomic does not solve this and can be a normal write when properly fenced. The barrier inserted after the FLAG init also does not guarantee proper init for every thread.

Don't know about the resource error Frederic mentions.

(I know this entry is old, but it tops google for certain queries.)

Post a Comment