tag:blogger.com,1999:blog-10567803175548903502024-03-13T23:16:33.123+01:00GPU/OpenCL ModelingDiscussion about parallel applications modeling and developing.Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.comBlogger35125tag:blogger.com,1999:blog-1056780317554890350.post-65754731387314684382011-07-06T15:55:00.006+02:002011-07-06T16:03:24.361+02:002 PhD studentships in Programming and Design of Low-power Heterogeneous Multi-Core Systems<span class="Apple-style-span" style="font-family: Times; font-size: medium; "><pre style="word-wrap: break-word; white-space: pre-wrap; ">The University of Edinburgh and ARM invites applications for two PhD studentships in the general area of heterogeneoous multi-cores. These positions are supported by ARM as part of the ARM Research Centre of Excellence at U.Edinburgh. </pre><pre style="word-wrap: break-word; white-space: pre-wrap; "><a href="http://www.ed.ac.uk/schools-departments/informatics/news-events/recentnews/fastercomputers">http://www.ed.ac.uk/schools-departments/informatics/news-events/recentnews/fastercomputers</a> </pre><pre style="word-wrap: break-word; white-space: pre-wrap; ">These fully funded PhD positions are open to students of <b>**all**</b> nationalities. </pre><pre style="word-wrap: break-word; white-space: pre-wrap; "><b>There is a wide range of topics covered by the centre and includes (but is not limited to):</b></pre><pre style="word-wrap: break-word; white-space: pre-wrap; "><ul><li>Mapping and scheduling for low power on Heterogeneous Multi-Processing</li><li>Programming for next generation CPU and GPGPU systems</li><li>Software and hardware to making multi-processing accessible to programmers</li><li>Parallelism discovery and automatic parallelisation of sequential programs</li><li>Compilation for low power</li><li>Compilers and runtime for next generation ARM architectures</li><li>Data-centre scale parallelism</li><li>Hardware assistance in detection of non-data race free concurrent programs</li><li>Security between cloud and terminal</li><li>High performance low power micro-architecture</li></ul></pre><pre style="word-wrap: break-word; white-space: pre-wrap; ">Suitable candidates will have a strong first degree in Computer Science and a strong interest in parallel programming, optimizing compilers, runtime systems, computer architecture, security or machine learning. The exact topic of the PhD depends on the candidate's interests. We are looking for the brightest minds to pursue research in a cutting-edge arena.</pre><pre style="word-wrap: break-word; white-space: pre-wrap; ">The start date is flexible.</pre><pre style="word-wrap: break-word; white-space: pre-wrap; ">Candidates are encouraged to contact Michael O'Boyle mob@inf.ed.ac.uk to informally discuss the project further. More information on our work can be found at: <a href="http://www.icsa.informatics.ed.ac.uk/compilers">http://www.icsa.informatics.ed.ac.uk/compilers</a> </pre><pre style="word-wrap: break-word; white-space: pre-wrap; ">Formal application will then be through the School's normal PhD application process: <a href="http://www.ed.ac.uk/schools-departments/informatics/postgraduate/degrees/phd/">http://www.ed.ac.uk/schools-departments/informatics/postgraduate/degrees/phd/</a></pre></span>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-69690769314054260322011-06-21T11:36:00.007+02:002011-06-21T12:11:40.767+02:00PhD Proposal: University of Perpignan<b>Title:</b> Real-time implementations of power flow algorithms in the context of multiprocessor.<br /><div><br /><b>Keywords:</b> Power-Flow analysis, electrical network, multiprocessor, High Performance Computing under constraint, OpenCL<br /><br /><b>Supervisor/contact:</b> David Defour (david.defour'AT'univ-perp.fr)<br /><br /><b>Location:</b> Perpignan/France Scientific<br /><br /><b>Context:</b> Electrical network simulation monopolises considerable efforts in today's international community. Currently the networks are statically sized to the expected and projected demand and which is said to be completely accessible. The electrical network of the future, especially if one takes into account the needs of energy required by electrical vehicles, can no longer follow this model of full accessibility: indeed, vehicles' consumption will be highly correlated (peak hours - commuting between home and office in the mornings and evenings) and networks should be resized drastically to take these problems into account. </div><div><br /></div><div>"Load flow" refers to the calculation of the distribution system according to the charges. If there are many projects or software for load flow simulation, none of them offer the performance to achieve a decent responsiveness. For example, in a given area, if 10 000 vehicles connect within 30 minutes, each vehicle would require a response within 20ms. This is not possible due to budget constraints as it would require the use of a supercomputer or a grid computer in every area that has to be considered (car park, city, region, ...). </div><div><br /></div><div><b>Goal:</b> The purpose of this thesis is to propose mechanisms that will lead to several implementations of the same algorithm regarding various objectives in the context of multicore architecture. As far as multicore architectures are concerned, improving performance is the main aim when developing applications such as the one presented above. </div><div><br /></div><div>However new challenges need to be alleviated in order to achieve performance for a reasonable investment. These challenges can be represented by the trade-off between two conflicting goals. On one side, opacity is required to hide to the programmer unnecessary details about the hardware, the memory hierarchy and the communication network. On another side, visibility of the fundamental elements is necessary to let the programmer harness the power of today's multicore architectures. </div><div><br /></div><div>However, additional constraints have to be taken into account. Among them, power consumption is nearly as important as performance. Usually the former is impacting the latter and vice versa. In this context, several implementations of the kernel should have to be considered depending on the targeted hardware or the workload of the architecture. This is kernel versioning. </div><div><br /></div><div>The objectives of this thesis is to define a framework where it will be possible to extend the usual criteria that leads to multi-versioning with for example: </div><div><b><i>Hardware constraints:</i></b> We are proposing to consider the usual constraints based on hardware characteristics, like memory, network interconnect and communication requirements of the application. </div><div><b><i>Context dependent constraints:</i></b> For a given task several algorithms can be considered depending on the size of the data or the objectives to achieve (power, latency or bandwidth).</div><div><b><i>Regularity constraints:</i></b> Regularity is essential when considering parallel applications. Depending on the architecture, we have to consider regularity on data structures, data values, execution or control. </div><div><b><i>Reliability constraints:</i></b> Depending on the hardware, we may have a memory hierarchy with or without ECC. We may want to handle hardware errors that happen during computation by duplicating them either in time or in space according to the architecture. </div><div><b><i>Accuracy constraints:</i></b> Multicore architectures make use of many floating-point units. These units implement the IEEE 754 standard, in which the 2008 revision introduces various representation formats, ranging from 16 to 128 bits. This encourages the adoption of mixed-precision in floating-point based algorithms. </div><div><br /></div><div>Proposed implementations will be tested and validated by an industrial partner of this project. </div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-83287614426724767482011-06-20T17:29:00.021+02:002011-06-20T18:42:21.562+02:00Printing OpenCL binary...I found this snippet (originally in C++) somewhere that I can't remember. Some changes to standard C and voilà how to write out your OpenCL program binary:<div><br /></div><br /><br /><script type="syntaxhighlighter" class="brush:c"><![CDATA[<br />void printBinaries()<br />{<br /> cl_uint program_num_devices;<br /> clGetProgramInfo(cpProgram,CL_PROGRAM_NUM_DEVICES,sizeof(cl_uint),&program_num_devices,NULL);<br /><br /> if (program_num_devices == 0)<br /> {<br /> fprintf(stderr, "no valid binary was found\n");<br /> return;<br /> }<br /><br /> size_t binaries_sizes[program_num_devices];<br /><br /> clGetProgramInfo(cpProgram,CL_PROGRAM_BINARY_SIZES,program_num_devices*sizeof(size_t),binaries_sizes,NULL);<br /><br /> char **binaries = (char**) malloc(sizeof(char*)*program_num_devices);<br /><br /> for (size_t i = 0; i < program_num_devices; i++)<br /> binaries[i] = (char*) malloc(sizeof(char)*(binaries_sizes[i]+1));<br /><br /> clGetProgramInfo(cpProgram, CL_PROGRAM_BINARIES, program_num_devices*sizeof(size_t), binaries, NULL);<br /><br /> for (size_t i = 0; i < program_num_devices; i++)<br /> {<br /> binaries[i][binaries_sizes[i]] = '\0';<br /> printf("Program %lu:\n",i);<br /> printf("%s\n",binaries[i]);<br /> }<br /><br /> for (size_t i = 0; i < program_num_devices; i++)<br /> free(binaries[i]);<br /><br /> free(binaries);<br />}<br />]]></script><br /><br />Just replace the cpProgram variable by yours. See below an example of result (NVidia PTX code):<br /><br /><script type="syntaxhighlighter" class="brush:c"><![CDATA[<br />Program 0:<br />//<br />// Generated by NVIDIA LLVM Compiler 3.2<br />//<br /><br />.version 2.2<br />.target sm_13, texmode_independent<br /><br /><br />.const .align 8 .b8 def___internal_i2opi_d[144] = { 0x08, 0x5D, 0x8D, 0x1F, 0xB1,<br />0x5F, 0xFB, 0x6B, 0xEA, 0x92, 0x52, 0x8A, 0xF7, 0x39, 0x07, 0x3D, 0x7B, 0xF1, 0xE5, <br />0xEB, 0xC7, 0xBA, 0x27, 0x75, 0x2D, 0xEA, 0x5F, 0x9E, 0x66, 0x3F, 0x46, 0x4F, 0xB7, <br />0x09, 0xCB, 0x27, 0xCF, 0x7E, 0x36, 0x6D, 0x1F, 0x6D, 0x0A, 0x5A, 0x8B, 0x11, 0x2F, <br />0xEF, 0x0F, 0x98, 0x05, 0xDE, 0xFF, 0x97, 0xF8, 0x1F, 0x3B, 0x28, 0xF9, 0xBD, 0x8B, <br />0x5F, 0x84, 0x9C, 0xF4, 0x39, 0x53, 0x83, 0x39, 0xD6, 0x91, 0x39, 0x41, 0x7E, 0x5F, <br />0xB4, 0x26, 0x70, 0x9C, 0xE9, 0x84, 0x44, 0xBB, 0x2E, 0xF5, 0x35, 0x82, 0xE8, 0x3E, <br />0xA7, 0x29, 0xB1, 0x1C, 0xEB, 0x1D, 0xFE, 0x1C, 0x92, 0xD1, 0x09, 0xEA, 0x2E, 0x49, <br />0x06, 0xE0, 0xD2, 0x4D, 0x42, 0x3A, 0x6E, 0x24, 0xB7, 0x61, 0xC5, 0xBB, 0xDE, 0xAB, <br />0x63, 0x51, 0xFE, 0x41, 0x90, 0x43, 0x3C, 0x99, 0x95, 0x62, 0xDB, 0xC0, 0xDD, 0x34, <br />0xF5, 0xD1, 0x57, 0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, 0x6E, 0x83, 0xF9, 0xA2 };<br />.const .align 4 .b8 def___GPU_i2opi_f[24] = { 0x41, 0x90, 0x43, 0x3C, 0x99, 0x95, <br />0x62, 0xDB, 0xC0, 0xDD, 0x34, 0xF5, 0xD1, 0x57, 0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, <br />0x6E, 0x83, 0xF9, 0xA2 };<br /><br />.entry ep_KRN__uCQs6obGEeCiXMyak_whYg(<br /> .param .u32 ep_KRN__uCQs6obGEeCiXMyak_whYg_param_0,<br /> .param .u32 .ptr .global .align 4 ep_KRN__uCQs6obGEeCiXMyak_whYg_param_1,<br /> .param .u32 .ptr .global .align 4 ep_KRN__uCQs6obGEeCiXMyak_whYg_param_2,<br /> .param .u32 .ptr .global .align 4 ep_KRN__uCQs6obGEeCiXMyak_whYg_param_3<br />)<br />{<br /> .reg .f32 %f<4>;<br /> .reg .pred %p<2>;<br /> .reg .s32 %r<39>;<br /><br />_ep_KRN__uCQs6obGEeCiXMyak_whYg:<br /> mov.u32 %r5, %tid.y;<br /> mov.u32 %r6, %envreg4;<br /> mov.u32 %r7, %envreg6;<br /> mov.u32 %r8, %ntid.x;<br /> mov.u32 %r9, %ctaid.x;<br /> mov.u32 %r10, %envreg3;<br /> mad.lo.s32 %r11, %r9, %r8, %r10;<br /> mov.u32 %r12, %tid.x;<br /> add.s32 %r13, %r5, %r6;<br /> mul.lo.s32 %r14, %r8, %r7;<br /> mov.u32 %r15, %envreg7;<br /> mov.u32 %r16, %tid.z;<br /> mov.u32 %r17, %envreg5;<br /> mov.u32 %r18, %ctaid.y;<br /> mov.u32 %r19, %ntid.y;<br /> mul.lo.s32 %r20, %r14, %r15;<br /> add.s32 %r21, %r11, %r12;<br /> mad.lo.s32 %r22, %r18, %r19, %r13;<br /> add.s32 %r23, %r16, %r17;<br /> mov.u32 %r24, %ctaid.z;<br /> mov.u32 %r25, %ntid.z;<br /> mad.lo.s32 %r26, %r14, %r22, %r21;<br /> mul.lo.s32 %r27, %r20, %r19;<br /> mad.lo.s32 %r28, %r24, %r25, %r23;<br /> mad.lo.s32 %r4, %r27, %r28, %r26;<br /> ld.param.u32 %r29, [ep_KRN__uCQs6obGEeCiXMyak_whYg_param_0];<br /> setp.lt.u32 %p1, %r4, %r29;<br /> @%p1 bra BB1_2;<br /><br />BB1_1:<br /> ret;<br /><br />BB1_2:<br /> ld.param.u32 %r3, [ep_KRN__uCQs6obGEeCiXMyak_whYg_param_3];<br /> ld.param.u32 %r2, [ep_KRN__uCQs6obGEeCiXMyak_whYg_param_2];<br /> ld.param.u32 %r1, [ep_KRN__uCQs6obGEeCiXMyak_whYg_param_1];<br /> shr.s32 %r30, %r4, 31;<br /> shr.u32 %r31, %r30, 28;<br /> add.s32 %r32, %r4, %r31;<br /> and.b32 %r33, %r32, 1073741808;<br /> sub.s32 %r34, %r4, %r33;<br /> shl.b32 %r35, %r34, 2;<br /> add.s32 %r36, %r3, %r35;<br /> add.s32 %r37, %r1, %r35;<br /> ld.global.f32 %f1, [%r37];<br /> ld.global.f32 %f2, [%r36];<br /> add.rn.f32 %f3, %f2, %f1;<br /> add.s32 %r38, %r2, %r35;<br /> st.global.f32 [%r38], %f3;<br /> ret;<br />}<br />]]></script>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-55166078241899995042011-06-13T18:55:00.006+02:002011-06-14T18:34:30.412+02:00OpenCL programs in Latex (listings package)<div>When we need to put some snippets of OpenCL in Latex documents usually we use C syntax definition for the listings package. I've added the definitions below to lstlang1.sty file(you can find it in the shared directory of your tex). Put this just after your C++ language definitions and before Objective C.</div><div><br /><pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;background-image:URL(http://2.bp.blogspot.com/_z5ltvMQPaa8/SjJXr_U2YBI/AAAAAAAAAAM/46OqEP32CJ8/s320/codebg.gif);padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> \lst@definelanguage[OpenCL]{C}[ANSI]{C} <br />{morekeywords={__kernel,kernel,__local,local,__global,global,% <br />__constant,constant,__private,private,% <br />char2,char3,char4,char8,char16,% <br />uchar2,uchar3,uchar4,uchar8,uchar16,% <br />short2,short3,short4,short8,short16,% <br />ushort2,ushort3,ushort4,ushort8,ushort16,% <br />int2,int3,int4,int8,int16,% <br />uint2,uint3,uint4,uint8,uint16,% <br />long2,long3,long4,long8,long16,% <br />ulong2,ulong3,ulong4,ulong8,ulong16,% <br />float2,float3,float4,float8,float16,% <br />image2d_t,image3d_t,sampler_t,event_t,% <br />bool2,bool3,bool4,bool8,bool16,% <br />half2,half3,half4,half8,half16,% <br />quad,quad2,quad3,quad4,quad8,quad16,% <br />complex,imaginary},% <br />}% <br /></code></pre><br /></div><div><br /></div><div>Then, setup your listing environment like this example:</div><div></div><br /><pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:100%;overflow:auto;background:#f0f0f0;;background-image:URL(http://2.bp.blogspot.com/_z5ltvMQPaa8/SjJXr_U2YBI/AAAAAAAAAAM/46OqEP32CJ8/s320/codebg.gif);padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> \lstset{language=[OpenCL]C,caption={Generated Kernel}, <br />label=case:listing2, <br />basicstyle=\tiny, <br />backgroundcolor=\color{whitegray}, <br />numbers=left, <br />numberstyle=\tiny} <br /></code></pre><br /><div><b>Hint: </b>Alternatively, you can put this configuration directly in your tex project. Copy and paste it along your \usepackage{listings} replacing "\lst@definelanguage" by "\lstdefinelanguage".</div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-1392358198232890382011-06-10T19:40:00.025+02:002011-06-11T18:55:42.169+02:00GPU and CPU Double Precision Performance<div style="text-align: center;"><a href="http://3.bp.blogspot.com/-CMJkP4H8_tE/TfOYpCqBe7I/AAAAAAAAA9E/sBjmlolpCDg/s1600/chart1.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 550px; height: 350px;" src="http://3.bp.blogspot.com/-CMJkP4H8_tE/TfOYpCqBe7I/AAAAAAAAA9E/sBjmlolpCDg/s400/chart1.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5617000991116327858" /></a></div><div><br /></div><div>After checking some data from wikipedia, vendor's specifications and so on, I made this chart above. The chart is about GPU and CPU performance in double precision. While the Intel Core I7 980X (extreme edition) gives us around 110GFLOPS (Source: <a href="http://www.tomshardware.com/fr/benchmark/charts-processeurs-2010/Raw-Performance-SiSoftware-Sandra-2010-Pro-GFLOPS,2409.html">Tom's Hardware</a>), GPUs such as AMD Radeon 6970 and NVidia C2090 offer more than 660GFLOPS. Obviously these benchmarks represent peak performance under specific conditions for each platform.</div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-77664202372659066802011-06-07T23:59:00.014+02:002011-06-08T02:33:50.266+02:00OpenCL Books<div><b><span class="Apple-style-span">OpenCL Programming Guide</span></b></div><div><b><span class="Apple-style-span" style="font-weight: normal; "><ul class="metadatalist" style="list-style-type: none; list-style-position: initial; list-style-image: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; width: 500px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 10px; "><li><p class="p data" style="margin-top: 0px; margin-right: 0px; margin-bottom: 3px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; line-height: 1.2em; "><strong class="strong">By: </strong><a target="_blank" href="http://www.informit.com/authors/author_bio.aspx?ISBN=9780321749642" style="text-decoration: none; outline-style: none; outline-width: initial; outline-color: initial; ">Aaftab Munshi; Benedict Gaster; Timothy G. Mattson; James Fung; Dan Ginsburg</a></p></li><li><p class="p data" style="margin-top: 0px; margin-right: 0px; margin-bottom: 3px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; line-height: 1.2em; "><strong class="strong">Publisher: </strong><span>Addison-Wesley Professional</span></p></li></ul></span></b></div><span class="Apple-style-span"><a href="http://ecx.images-amazon.com/images/I/419e7nPUiJL._SS500_.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="float:right;cursor:pointer; cursor:hand;width: 300px; height: 300px;" src="http://ecx.images-amazon.com/images/I/419e7nPUiJL._SS500_.jpg" border="0" alt="" /></a>Chapter 1. An Introduction to OpenCL<br />Chapter 2. HelloWorld: An OpenCL Example</span><div><span class="Apple-style-span">Chapter 3. Platforms, Contexts, and Devices</span></div><div><span class="Apple-style-span">Chapter 4. Programming with OpenCL C</span></div><div><span class="Apple-style-span">Chapter 5. OpenCL C Built-in Functions</span></div><div><span class="Apple-style-span">Chapter 6. Programs and Kernels</span></div><div><span class="Apple-style-span">Chapter 7. Buffers and Sub-Buffers</span></div><div><span class="Apple-style-span">Chapter 8. Images and Samplers</span></div><div><span class="Apple-style-span">Chapter 9. Events</span></div><div><span class="Apple-style-span">Chapter 10. Interoperability with OpenGL</span></div><div><span class="Apple-style-span">Chapter 11. Interoperability with Direct3D</span></div><div><span class="Apple-style-span">Chapter 12. C++ Wrapper API</span></div><div><span class="Apple-style-span">Chapter 13. OpenCL Embedded Profile</span></div><div><span class="Apple-style-span">Chapter 14. Image Histogram</span></div><div><span class="Apple-style-span">Chapter 15. Sobel Edge Detection Filter</span></div><div><span class="Apple-style-span">Chapter 16. Parallelizing Dikjstra’s Single Source Shortest Path Graph Algorithm</span></div><div><span class="Apple-style-span">Chapter 17. Cloth Simulation in the Bullet Physics SDK</span></div><div><span class="Apple-style-span">Chapter 18. Simulating the Ocean with Fast Fourier Transform</span></div><div><span class="Apple-style-span">Chapter 19. Optical Flow</span></div><div><span class="Apple-style-span">Chapter 20. Using OpenCL with PyOpenCL</span></div><div><span class="Apple-style-span">Chapter 21. Matrix Multiplication with OpenCL</span></div><div><span class="Apple-style-span">Chapter 22. Sparse Matrix-Vector Multiplication</span></div><div><span class="Apple-style-span">Appendix A. Summary of OpenCL 1.1</span></div><div><span class="Apple-style-span"><br /></span></div><div><span class="Apple-style-span"><br /></span></div><div><span class="Apple-style-span"><br /></span></div><div><span class="Apple-style-span"><b>The OpenCL Programming Book</b></span></div><div><span class="Apple-style-span"></span><span class="Apple-style-span" style="line-height: 20px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; "><strong style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; ">Publisher:</strong> Fixstars Corporation <strong style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; ">Author:</strong> Fixstars Corporation (Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki)</span></div><div><span id="btAsinTitle"><span class="Apple-style-span" style="line-height: 20px; -webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; "><div><b>Introduction to Parallelization</b></div><div>Why Parallell</div><a href="http://ecx.images-amazon.com/images/I/51dX9NNHmQL._SS500_.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 300px; height: 300px;" src="http://ecx.images-amazon.com/images/I/51dX9NNHmQL._SS500_.jpg" border="0" alt="" /></a><div>Parallel Computing (Hardware)</div><div>Parallel Computing (Software)</div><div>Conclusion</div><div><b>OpenCL</b></div><div>What is OpenCL?</div><div>Historical Background</div><div>An Overview of OpenCL</div><div>Why OpenCL?</div><div>Applicable Platforms</div><div><b>OpenCL Setup</b></div><div>Available OpenCL Environments</div><div>Developing Environment Setup</div><div>First OpenCL Program</div><div>Basic OpenCL</div><div>Basic Program Flow</div><div>Online/Offline Compilation</div><div>Calling the Kernel</div><div><b>Advanced OpenCL</b></div><div>OpenCL C</div><div>OpenCL Programming Practice</div><div><b>Case Study</b></div><div>FFT (Fast Fourier Transform)</div><div>Mersenne Twister</div><div>Notes</div><div><br /></div><div><br /></div><div><div><b><span class="Apple-style-span">Heterogeneous Computing with OpenCL</span></b></div><div>Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, Dana Schaa</div></div><a href="http://ecx.images-amazon.com/images/I/51VJue%2B4UVL._SS500_.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 300px; height: 300px;" src="http://ecx.images-amazon.com/images/I/51VJue%2B4UVL._SS500_.jpg" border="0" alt="" /></a><div><div><ol><li>Introduction to Parallel Programming</li><li>Introduction to OpenCL</li><li>OpenCL Device Architectures</li><li>Basic OpenCL Examples</li><li>Understanding OpenCL's Concurrency and Execution Model</li><li>Dissecting a CPU/GPU OpenCL Implementation</li><li>OpenCL Case Study: Convolution</li><li>OpenCL Case Study: Video Processing</li><li>OpenCL Case Study: Histogram</li><li>OpenCL Case Study: Mixed Particle Simulation</li><li>OpenCL Extensions</li><li>OpenCL Profiling and Debugging</li><li>WebCL</li></ol></div></div></span></span></div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-8909392231668406112011-06-06T20:32:00.008+02:002011-06-06T21:06:49.590+02:00AMD Radeon™ E6760 Embedded GPU OpenCL Compliant<div style="text-align: center;"><a href="http://3.bp.blogspot.com/-aN_Hi5vQu7I/Te0eAQZaVII/AAAAAAAAA7k/RoIliDgd0a0/s1600/amdgpu.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 203px;" src="http://3.bp.blogspot.com/-aN_Hi5vQu7I/Te0eAQZaVII/AAAAAAAAA7k/RoIliDgd0a0/s400/amdgpu.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5615177300151260290" /></a></div><div><br /></div><div>AMD has released the first embedded GPU offering suport for OpenCL.</div><div>More information:</div><div><a href="http://www.amd.com/us/press-releases/Pages/embedded-gpu-2011may02.aspx">http://www.amd.com/us/press-releases/Pages/embedded-gpu-2011may02.aspx</a></div><div><a href="http://www.amd.com/us/Documents/49807_E6760_GPU_brief.pdf">http://www.amd.com/us/Documents/49807_E6760_GPU_brief.pdf</a></div><div><br /></div><div style="text-align: left;">Why does NVidia not do the same for Tegra?</div><div><br /></div><div>Anyway, let's create OpenCL HP applications for embedded systems.</div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-29980314636449935262011-06-03T02:40:00.060+02:002011-06-18T10:42:26.492+02:00Anjuta Project Wizards for AMD, NVidia and Intel OpenCL SDK<div style="text-align: left;">Aiming at increasing the OpenCL developing, I created some wizards to start up an OpenCL application project using the SDK from NVidia, AMD or Intel. I've used Anjuta DevStudio on Linux.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">This is just a first approach, so don't be disappointed if you need to do some changes. The wizards give you a simple functional code and you can work on this one.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">System requirements (they depend on your goal):</div><ul><li style="text-align: left;">NVidia: GPU Computing SDK code samples, CUDA Toolkit (<a href="http://developer.nvidia.com/">http://developer.nvidia.com</a>)<br /></li><li style="text-align: left;">AMD APP SDK (<a href="http://developer.amd.com/">http://developer.amd.com</a>)</li><li style="text-align: left;">INTEL OpenCL SDK (<a href="http://software.intel.com/en-us/articles/opencl-sdk">http://software.intel.com/en-us/articles/opencl-sdk</a>)<br /></li><li style="text-align: left;">Anjuta DevStudio 2.30+</li></ul><div><div style="text-align: left;"><a name='more'></a><br /></div><span class="Apple-style-span"><div style="text-align: left;"><b>OpenCL Syntax Highlighting</b></div></span></div><div style="text-align: left;">Usually, files with <i>.cl </i>extension<i> </i>don't have any associated language specification. Download the opencl.lang (<a href="http://www.streamcomputing.eu/downloads/?opencl.lang">http://www.streamcomputing.eu/downloads/?opencl.lang</a>, by the way, thanks to streamcomputing blog) and copy it to your <i>~/.local/share/gtksourceview-2.0/language-specs</i> directory. Make a test opening a <i>.cl</i> file with gedit.</div><a href="http://3.bp.blogspot.com/-hWLfLZBQeew/TelCJUjexKI/AAAAAAAAA60/Quj9HyvqhIw/s1600/anjuta-2.1.2-1.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="float:right;cursor:pointer; cursor:hand;width: 320px; height: 244px;" src="http://3.bp.blogspot.com/-hWLfLZBQeew/TelCJUjexKI/AAAAAAAAA60/Quj9HyvqhIw/s320/anjuta-2.1.2-1.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614091138397815970" /></a><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Anjuta DevStudio</b></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><div style="text-align: left;"><span class="Apple-style-span">It is a versatile software development studio featuring a number of advanced programming facilities</span> including project management, application wizard, interactive debugger, source editor, version control, GUI<span class="Apple-style-span"> designer, profiler and many more tools. It focuses on providing simple and usable user </span><span class="Apple-style-span">inter</span><span class="Apple-style-span">face, yet powerful for efficient development. [</span><a href="http://projects.gnome.org/anjuta/">http://projects.gnome.org/anjuta/</a><span class="Apple-style-span" style="font-size: medium; ">]</span></div><div style="text-align: left;"><span class="Apple-style-span"><br /></span></div><div style="text-align: left;"><b><span class="Apple-style-span">New OpenCL Wizards</span></b></div><div style="text-align: left;"><span class="Apple-style-span">Create (if that's not there) your ~/.local/share/anjuta/project</span></div><div style="text-align: left;"><span class="Apple-style-span">~/.local/share/anjuta/project$ wget <a href="https://sites.google.com/site/wendellrodrigues/projects/OpenCLWizards.tgz">https://sites.google.com/site/wendellrodrigues/projects/OpenCLWizards.tgz</a></span></div><div style="text-align: left;"><span class="Apple-style-span">~/.local/share/anjuta/project$ tar xvzf OpenCLWizards.tgz</span></div><div style="text-align: left;"><b><br /></b></div><div style="text-align: left;">Open <i>anjuta</i> and try to create a N<b>ew Project</b>.</div><div style="text-align: left;"><b><br /><a href="http://2.bp.blogspot.com/-dxVVQjek-8g/TekpGYAlI2I/AAAAAAAAA58/_DzqcrGdfcg/s1600/anjuta2.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 232px;" src="http://2.bp.blogspot.com/-dxVVQjek-8g/TekpGYAlI2I/AAAAAAAAA58/_DzqcrGdfcg/s400/anjuta2.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614063599994872674" /></a><br /></b></div><div style="text-align: left;"><b><br /></b></div><div style="text-align: left;">Now we can see new types of project wizard in the <b>C</b> tab: AMD and Intel and in the <b>C++</b> tab: AMD, Intel and NVidia.</div><div style="text-align: left;"><b><br /></b></div><div style="text-align: left;"><span class="Apple-style-span"><span class="Apple-style-span"></span><b>AMD APP SDK</b></span></div><div style="text-align: left;">Creating a C project:<br /><br /><a href="http://3.bp.blogspot.com/-nPyoq7heRck/TeksQJPrenI/AAAAAAAAA6E/3WJg-oGRi_Q/s1600/anjuta3.png" style="font-weight: bold; " onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 239px;" src="http://3.bp.blogspot.com/-nPyoq7heRck/TeksQJPrenI/AAAAAAAAA6E/3WJg-oGRi_Q/s400/anjuta3.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614067066365246066" /></a><br /><br /><a href="http://4.bp.blogspot.com/-fVub8hnx_XQ/Teks-l3KXQI/AAAAAAAAA6M/-ZjS0AA-gjU/s1600/anjuta4.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 232px;" src="http://4.bp.blogspot.com/-fVub8hnx_XQ/Teks-l3KXQI/AAAAAAAAA6M/-ZjS0AA-gjU/s400/anjuta4.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614067864321023234" /></a><br /><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">For C++ applications the process is similar. The file templates are based on <i>Template</i> and <i>TemplateC </i>from the SDK and the projects will be created (default) in "<i>samples/opencl/myprojects/app</i>". If you change the default destination, please be careful of relative references in the Makefile.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>NVidia SDK</b></div><div style="text-align: left;">For NVidia applications we have only the <b>C++</b>. Even though the samples' code are standard <b>C</b>, the provided Makefile and includes are made targeting <i><b>g++</b></i>. That's not a problem and you can write your code in <b>C</b> as well as <b>C++</b>. The wizard will ask you for the "<i>NVIDIA_GPU_Computing_SDK</i>" path. By default, projects will be created in "<i>OpenCL/myprojects</i>".</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Intel SDK</b></div><div style="text-align: left;">Like for AMD SDK we have <b>C</b> and <b>C++ </b>wizards. But, they use Makefiles generated by the <i><a href="http://sources.redhat.com/autobook/">autotools</a>.</i> Intel SDK in Linux environments is not yet stable, so these wizards regard the folder where you have installed the <i><b>lib64</b></i>(yes, for Linux we have, until now, just the 64bits version)<i> </i>and <i><b>include</b></i> folders provided by the SDK package. By default projects will be installed at the same <b><i>include</i> </b>and <i style="font-weight: bold; ">lib64 </i>directory level.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Moreover, the Makefiles take into account these points:</div><div style="text-align: left;"><ul><li>CL includes: -I../../include</li><li>CL Dynamic Library: -lOpenCL (located at <i>lib64/libOpenCL.so</i>)</li></ul></div><a href="http://4.bp.blogspot.com/-tl3VsJa6q-8/Tek6l15x5YI/AAAAAAAAA6U/PjyNHbO_k8A/s1600/anjuta5.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 240px;" src="http://4.bp.blogspot.com/-tl3VsJa6q-8/Tek6l15x5YI/AAAAAAAAA6U/PjyNHbO_k8A/s400/anjuta5.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614082832293029250" /></a><br /><a href="http://3.bp.blogspot.com/-RdVkcqw9gL0/Tek65AlTKkI/AAAAAAAAA6c/BhuXe8U2pt0/s1600/anjuta6.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 232px;" src="http://3.bp.blogspot.com/-RdVkcqw9gL0/Tek65AlTKkI/AAAAAAAAA6c/BhuXe8U2pt0/s400/anjuta6.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614083161577433666" /></a><br /><a href="http://3.bp.blogspot.com/-tAy0J5DrFKs/Tek65inVdHI/AAAAAAAAA6k/Kwu9WuS7BjU/s1600/anjuta7.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 400px; height: 232px;" src="http://3.bp.blogspot.com/-tAy0J5DrFKs/Tek65inVdHI/AAAAAAAAA6k/Kwu9WuS7BjU/s400/anjuta7.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614083170712777842" /></a><div><br /></div><div>To run your OpenCL programs sometimes you need to define the <i>work directory, executable </i>and <i>possible parameters.</i></div><br /><a href="http://3.bp.blogspot.com/-3KZ13woUD2w/Tek7_hCtI6I/AAAAAAAAA6s/zPYjsJ8DeRY/s1600/anjuta8.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="cursor:pointer; cursor:hand;width: 320px; height: 191px;" src="http://3.bp.blogspot.com/-3KZ13woUD2w/Tek7_hCtI6I/AAAAAAAAA6s/zPYjsJ8DeRY/s320/anjuta8.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5614084372881548194" /></a><div><br /></div><div><b>Final Considerations</b></div><div>I hope these wizards give you some curiosity to start programming in OpenCL. Feel free to ask me if you have any questions or just to comment.</div></div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com2tag:blogger.com,1999:blog-1056780317554890350.post-91769267463651382852011-05-28T00:04:00.006+02:002011-05-28T01:20:16.068+02:00OpenCL Group in LinkedinI find important sharing information among people using OpenCL. Recently, I joined to OpenCL group in <a href="http://www.linkedin.com/">Linkedin</a>. Some discussions and points of view that you'll find there:<div><ol><li>the lack of domain specific libraries compared to CUDA, for instance. However, AMD has released stuff like <a href="http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx">http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx</a>. Anyway, we should have always in mind that OpenCL is younger than CUDA.</li><li>questions about why use OpenCL in environments without GPUs were answered by people arguing about the unknown user space (where your app will run) and stuff like <a href="http://www.khronos.org/developers/library/2010_siggraph_bof_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf">http://www.khronos.org/developers/library/2010_siggraph_bof_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf</a> that points the direction of applications implemented on Intel's SDK.</li></ol><div>The group is open and some helpful and qualified helpful members are creating interesting discussions.</div></div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-24935756525049462772011-05-27T23:49:00.010+02:002011-05-28T01:20:52.742+02:00Parallel Primitives LibraryLets try to improve OpenCL libraries for everything...<br /><br />Libs like this:<br /><br /><blockquote>clpp is an OpenCL Data Parallel Primitives Library. It is a library of data-parallel algorithm primitives such as parallel-prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.<br /><br />If you want to join the project, please simply send a message to our mailing list: <a href="http://groups.google.com/group/cl-pp">http://groups.google.com/group/cl-pp</a><p></p></blockquote>Thanks to <a href="http://be.linkedin.com/pub/polar-01/21/5b/a56">Polar Lights</a>?? initiative. ;)Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-49181317834080311292011-05-25T15:44:00.008+02:002011-05-25T16:09:42.707+02:00Profiling your applicationI was away for a time, but I'm coming back. In my recent projects, I had taken a few results from profiling tools. For a better understanding, I grabbed some resources from Marcus Bannerman's website. Putting the link...<br /><br /><a href="http://www.marcusbannerman.co.uk/images/stories/lectures/opencl/slides6.pdf">Profiling in Linux</a><br /><br />I didn't ask no authorization to publish it in this blog, but I'm sure that he will not be upset about that.<br /><br />Take a look at other articles from his website. There are a lot of stuff about OpenCL <a href="http://www.marcusbannerman.co.uk">http://www.marcusbannerman.co.uk</a>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-88383211447142640972010-04-30T17:57:00.003+02:002011-05-30T17:22:58.392+02:00Laboratoire Jacques-Louis Lions - Paris VI: Gaspard2 and OpenCL Presentation<div class="prezi-player"><style type="text/css" media="screen">.prezi-player { width: 480px; } .prezi-player-links { text-align: center; }</style><object id="prezi_jzplm0k4agbn" name="prezi_jzplm0k4agbn" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" width="480" height="400"><param name="movie" value="http://prezi.com/bin/preziloader.swf"><param name="allowfullscreen" value="true"><param name="allowscriptaccess" value="always"><param name="bgcolor" value="#ffffff"><param name="flashvars" value="prezi_id=jzplm0k4agbn&lock_to_path=1&color=ffffff&autoplay=no"><embed id="preziEmbed_jzplm0k4agbn" name="preziEmbed_jzplm0k4agbn" src="http://prezi.com/bin/preziloader.swf" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="480" height="400" bgcolor="#ffffff" flashvars="prezi_id=jzplm0k4agbn&lock_to_path=1&color=ffffff&autoplay=no"></embed></object><div class="prezi-player-links"><p><a title="Presentation on Paris VI at 29 avril" href="http://prezi.com/jzplm0k4agbn/">Gaspard2 and OpenCL</a> on <a href="http://prezi.com/">Prezi</a></p></div></div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com1tag:blogger.com,1999:blog-1056780317554890350.post-67258335791568852372010-04-30T17:41:00.003+02:002010-04-30T17:53:41.109+02:00Global Synchronization??ATI Radeon™ HD 5870 (“Cypress”) Architecture (2009) has Global Synhcronization. According to its specification:<br />2.72 Teraflops Single Precision,544 Gigaflops Double Precision<br /><ul><li>Full Hardware Implementation of DirectCompute 11 and OpenCL™ 1.0</li><li>IEEE754-2008 Compliance Enhancements</li><li>Additional Compute Features:</li><li>32-bit Atomic Operations</li><li>Flexible 32kB Local Data Shares</li><li>64kB Global Data Share</li><li style="font-weight: bold;">Global synchronization</li><li>Append/consume buffers</li></ul>I searched for more information about it, examples in OpenCL, etc. but I didn't find. If you have any tips...write me.Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-56655242141610323662010-04-22T13:47:00.001+02:002010-04-22T13:49:40.663+02:00Eyeon Software Unveils Fusion 6.1 with OpenCL SupercomputingFrom Khronos Group News:<br /><br /><blockquote>Fusion 6.1 now supports the OpenCL language, which allows tools to take advantage of the GPU in modern NVIDIA and ATI graphics cards to achieve tremendous speed increases. Benchmarking improvements of up to 1000% on some of the most processor-intensive tools in Fusion (such as Defocus and Noise generators) have been reached. You can also insert OpenCL code directly into Fuse tools to create in-house GPU-accelerated tools.</blockquote><a href="http://www.eyeonline.com/Web/EyeonWeb/Press/DisplayArticle.aspx?articleid=406">http://www.eyeonline.com/Web/EyeonWeb/Press/DisplayArticle.aspx?articleid=406</a>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-74752877577923247942010-04-13T14:09:00.006+02:002010-04-13T14:19:54.558+02:00Code Generation for OpenCL (Prezi Presentation)I'm working hard to code our compiler to opencl source code. Meanwhile, I posted this presentation made on Prezi for the Valeo PhD Students Seminary.<br /><div class="prezi-player"><style type="text/css" media="screen">.prezi-player { width: 475px; } .prezi-player-links { text-align: center; }</style><object id="prezi_1unsadlk_5go" name="prezi_1unsadlk_5go" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" height="400" width="475"><param name="movie" value="http://prezi.com/bin/preziloader.swf"><param name="allowfullscreen" value="true"><param name="allowscriptaccess" value="always"><param name="bgcolor" value="#ffffff"><param name="flashvars" value="prezi_id=1unsadlk_5go&lock_to_path=1&color=ffffff&autoplay=no"><embed id="preziEmbed_1unsadlk_5go" name="preziEmbed_1unsadlk_5go" src="http://prezi.com/bin/preziloader.swf" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" bgcolor="#ffffff" flashvars="prezi_id=1unsadlk_5go&lock_to_path=1&color=ffffff&autoplay=no" height="400" width="475"></embed></object><div class="prezi-player-links"><p><a title="" href="http://prezi.com/1unsadlk_5go/">Gaspard2, GPU and Valeo</a> on <a href="http://prezi.com/">Prezi</a></p></div></div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-27432121847918595542009-12-08T20:03:00.011+01:002010-04-22T14:20:42.315+02:00OpenCL Kernel for Scalar Product with Atomic OperationsThe final sum of the dotproduct example is implemented on CPU. This is a solution of Scalar Product (DotProduct) without final reduction on the host side. This example uses atomic operations.<br /><br /><pre class="prettyprint"><br />/*<br />* sDOT OpenCL Kernel Function for Level 1 BLAS Dot Product dot<-xy * Author; Wendell Rodrigues <wendell.rodrigues@inria.fr><br />* INRIA-Lille :: DaRT Team<br />*/<br />__kernel void sDOT(<br />__global const unsigned int N,<br />__global const float* X,<br />__global const float* Y,<br />__global float* DOT,<br />__global int* FLAG,<br />__local float* sdata<br />)<br />{<br />// get index into global data array<br />unsigned int tid = get_local_id(0);<br />unsigned int i = get_global_id(0);<br /><br />sdata[tid] = (i<N) ? X[i]*Y[i] : 0;<br /><br />if (i==0) {<br />DOT[0]=0;<br />*FLAG=0;<br />}<br /><br />barrier(CLK_LOCAL_MEM_FENCE);<br /><br /><br />// do reduction in shared mem<br />for(unsigned int s=1; s < get_local_size(0); s *= 2)<br />{<br /> int index = 2 * s * tid;<br /><br /> if (index < get_local_size(0))<br /> {<br /> sdata[index] += sdata[index + s];<br /> }<br /><br /> barrier(CLK_LOCAL_MEM_FENCE);<br />}<br /><br />// write result for this block to global mem<br />if (tid == 0) {<br /> while (atom_cmpxchg(FLAG,0,1)==1);<br /> DOT[0] += sdata[0];<br /> atom_cmpxchg(FLAG,1,0);<br />}<br /><br />}<br /><br /></pre>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com4tag:blogger.com,1999:blog-1056780317554890350.post-11723880575904280262009-11-10T12:52:00.002+01:002009-11-10T13:18:18.545+01:00Modeling Challenges on OpenCL Code GenerationThis week, I presented an overview of the integration of OpenCL and Gaspard2. In order to overcome the many challenges of model conception and transformations, we need study a new MoC (other than Array-OL), coalescent memory allocation and task distribute. The slides have two good examples of OpenCL applications and some questions about model conception and code generation.<br /><br /><a href="http://dl.dropbox.com/u/591431/Wendell_OpenCL_Modeling_2009.pdf">PDF</a>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-56826958126534699422009-11-10T12:43:00.003+01:002009-11-10T12:50:06.000+01:00OpenWF - The Standard for building composited windowing systemsFrom Khronos Group:<br /><br /><blockquote><br />Embedded devices are increasingly expected to offer sophisticated user interfaces that combine rich graphics with multimedia content. Graphics and display hardware technologies have evolved to achieve these visuals with significantly higher efficiency than traditional CPUs, delivering greater performance, decreasing memory bandwidth usage and increasing battery life. Making use of this variety of hardware introduces fragmentation as software needs to be adapted to each hardware configuration.<br /><br />A platform’s Hardware Abstraction Layer (HAL) for display and graphics technology allows the applications and middleware layers above to be deployed across a range of hardware without costly porting activities. OpenGL is an example of a graphics HAL that allows portable software to take advantage of a wide range of 3D hardware accelerators.<br /><br />Windowing systems allow screens to be shared by multiple applications, ensuring that the graphics provided for each application’s window is sensibly merged onto the screen. This requires the graphics and display drivers to respect the intentions of the windowing system, which commonly means considerable OS-specific porting work on the part of the device manufacturer when moving to new hardware.<br /><br />The OpenWF APIs provide an OS-independent and hardware-neutral foundation for building compositing systems, particularly suited to implementing windowing systems. OpenWF acts as a HAL to achieve composition of content and configuration of display devices. The interfaces are designed for use by a single user which could be a central windowing system or, in an application-specific system, may be the application itself.</blockquote><br /><br /><a href="http://www.khronos.org/openwf/">http://www.khronos.org/openwf/</a>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-83918938080897811162009-10-23T17:18:00.003+02:002009-10-23T17:32:06.180+02:00Conjugate Gradient and OpenCLI've just finished a conjugate gradient implementation for OpenCL. It has not performance yet, but I'm working on this to fix the bugs and/or optimize the code.<br /><br />Here you are a <a href="http://sites.google.com/site/wendellrodrigues/projects/present.pdf?attredirects=0&d=1">PDF</a> that makes an overview on the subject.Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com2tag:blogger.com,1999:blog-1056780317554890350.post-60748711840053964252009-10-16T23:57:00.007+02:002009-10-17T00:13:53.898+02:00Nvidia's Next Generation: Fermi - key architectural highlights<span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Third Generation Streaming Multiprocessor (SM)</span><br /><ul><li>32 CUDA cores per SM, 4x over GT200</li><li>8x the peak double precision floating point performance over GT200</li><li style="color: rgb(51, 204, 0);">Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps</li><li>64 KB of RAM with a configurable partitioning of shared memory and L1 cache</li></ul><br /><span style="font-weight: bold;">Second Generation Parallel Thread Execution ISA</span><br /><ul><li style="color: rgb(51, 204, 0);">Unified Address Space with Full C++ Support</li><li style="color: rgb(51, 204, 0);">Optimized for OpenCL and DirectCompute</li><li>Full IEEE 754-2008 32-bit and 64-bit precision</li><li>Full 32-bit integer path with 64-bit extensions</li><li>Memory access instructions to support transition to 64-bit addressing</li><li>Improved Performance through Predication</li></ul><br /><span style="font-weight: bold;">Improved Memory Subsystem</span><br /><ul><li>NVIDIA Parallel DataCache™ hierarchy with Configurable L1 and Unified L2</li><li>Caches</li><li>First GPU with ECC memory support</li><li>Greatly improved atomic memory operation performance</li></ul><br /><span style="font-weight: bold;">NVIDIA GigaThread™ Engine</span><br /><ul><li>10x faster application context switching</li><li>Concurrent kernel execution</li><li style="color: rgb(51, 204, 0);">Out of Order thread block execution</li><li style="color: rgb(51, 204, 0);">Dual overlapped memory transfer engines</li></ul>more information:<a href="http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf"> http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf</a><br /></blockquote>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com3tag:blogger.com,1999:blog-1056780317554890350.post-81793021285438786222009-10-15T11:13:00.001+02:002009-10-15T11:15:26.813+02:00ATI Stream Software Development Kit (SDK) v2.0 Beta Program<p><strong></strong></p><blockquote><p><strong>What’s New in v2.0-beta4</strong></p> <ul style="list-style-type: disc;"><li><span>First beta release of ATI Stream SDK with OpenCL™ GPU support.</span> </li><li><span>ATI Stream SDK v2.0 OpenCL™ is <a href="http://www.khronos.org/adopters/conformant-products/" title="Link opens in new browser window." target="_blank">certified OpenCL™ 1.0 conformant by Khronos</a>. </span></li><li><span>Added Microsoft® Windows® 7 support.</span> </li><li><span>Added native Microsoft® Windows® 64-bit support.</span> </li><li><span>Float comparisons in kernels no longer produce a runtime error.</span> </li><li><span>Various other issues from previous v2.0 beta releases have been resolved.<span style="font-weight: bold;"></span></span></li></ul><span style="font-size:78%;">More information: <a href="http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx">http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx</a></span><br /></blockquote>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-3465282582421026652009-10-08T12:12:00.002+02:002009-10-08T12:33:56.852+02:00OpenCL BLAS - Makefile for MACThanks to Mario Rometsch for a version of OpenCL BLAS Makefile for MacOS. You can download it on the SourceForge.<br /><br /><a href="https://sourceforge.net/projects/openclblas/files/Makefile/download">OpenCL BLAS Makefile for MacOS</a>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-66902426431413238992009-09-21T15:26:00.012+02:002011-05-28T01:18:09.346+02:00BLAS Library for OpenCLI use the conjugate gradient solver without preconditioners to solve a linear system <span style="font-style: italic;">Ax=b</span>, where <span style="font-style: italic;">A</span> is a sparse matrix. This method is iterative and uses some BLAS functions like<span style="font-style: italic;"> Dot Product, Scalar Product, xAXPY and xGEMV (SpMV for sparse matrix)</span>.I've started to develop these functions for the OpenCL language and I've decided to share them.<br /><br />Right now, the following BLAS level 1 functions are available:<br /><span style="font-weight: bold;">sDOT</span> :: single precision dot product or scalar product (dot<-xy) <span style="font-weight: bold;"><br />sNRM2</span> :: single precision vector 2-norm<br /><span style="font-weight: bold;">sSCAL</span> :: single precision product of vector by scalar (x<-ax) <span style="font-weight: bold;"><br />sAXPY</span> :: single precision AXPY (y<-ax + y) <a href="http://sites.google.com/site/wendellrodrigues/projects">You can download the OpenCL code which was tested on NVIDIA Tesla C870 and GPU Computing SDK 2.3</a><br /><br /><a href="http://sourceforge.net/projects/openclblas/">SourceForge Project</a><br /><br />Please join up with your contribution!<div><br /></div><div>Update: OpenCL BLAS now is a discontinued project.</div>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com12tag:blogger.com,1999:blog-1056780317554890350.post-80517251565474728622009-09-14T16:29:00.004+02:002009-09-14T16:42:48.205+02:00ATI Stream Software Development Kit (SDK) v2.0 Beta Program With OpenCL™ 1.0 SupportWith ATI Stream SDK, AMD/ATI provides a way to program OpenCL on its cards. I didn't download it yet, but you can get more information on:<br /><br /><a href="http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx">http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx</a><br /><br />and on the OpenCL/ATI forum:<br /><a href="http://forums.amd.com/devforum/categories.cfm?catid=390&entercat=y">http://forums.amd.com/devforum/categories.cfm?catid=390&entercat=y</a><br /><br />I'm going to test it and I will post here an overview.Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0tag:blogger.com,1999:blog-1056780317554890350.post-58268326279852346472009-09-11T17:47:00.004+02:002011-06-08T13:07:08.916+02:00GPU and MatlabIf you like programming Matlab-like environment, I suggest the freeware GPUMat from GP-you Group. You can explore the power of GPUs, BLAS and FFT libraries on NVIDIA cards. You can get more information on:<br /><a href="http://gp-you.org/">GP-you Group</a>Wendell Rodrigueshttp://www.blogger.com/profile/07091231311994227526noreply@blogger.com0