GPU/OpenCL Modeling

2 PhD studentships in Programming and Design of Low-power Heterogeneous Multi-Core Systems

2011-07-06T15:55:00.006+02:00

The University of Edinburgh and ARM invites applications for two PhD studentships in the general area of heterogeneoous multi-cores. These positions are supported by ARM as part of the ARM Research Centre of Excellence at U.Edinburgh.

http://www.ed.ac.uk/schools-departments/informatics/news-events/recentnews/fastercomputers

These fully funded PhD positions are open to students of **all** nationalities.

There is a wide range of topics covered by the centre and includes (but is not limited to):

Mapping and scheduling for low power on Heterogeneous Multi-Processing
Programming for next generation CPU and GPGPU systems
Software and hardware to making multi-processing accessible to programmers
Parallelism discovery and automatic parallelisation of sequential programs
Compilation for low power
Compilers and runtime for next generation ARM architectures
Data-centre scale parallelism
Hardware assistance in detection of non-data race free concurrent programs
Security between cloud and terminal
High performance low power micro-architecture

Suitable candidates will have a strong first degree in Computer Science and a strong interest in parallel programming, optimizing compilers, runtime systems, computer architecture, security or machine learning.  The exact topic of the PhD depends on the candidate's interests.  We are looking for the brightest minds to pursue research in a cutting-edge arena.

The start date is flexible.

Candidates are encouraged to contact Michael O'Boyle mob@inf.ed.ac.uk to informally discuss the project further. More information on our work can be found at: http://www.icsa.informatics.ed.ac.uk/compilers

Formal application will then be through the School's normal PhD application process: http://www.ed.ac.uk/schools-departments/informatics/postgraduate/degrees/phd/

PhD Proposal: University of Perpignan

2011-06-21T11:36:00.007+02:00

Title: Real-time implementations of power flow algorithms in the context of multiprocessor.

Keywords: Power-Flow analysis, electrical network, multiprocessor, High Performance Computing under constraint, OpenCL

Supervisor/contact: David Defour (david.defour'AT'univ-perp.fr)

Location: Perpignan/France Scientific

Context: Electrical network simulation monopolises considerable efforts in today's international community. Currently the networks are statically sized to the expected and projected demand and which is said to be completely accessible. The electrical network of the future, especially if one takes into account the needs of energy required by electrical vehicles, can no longer follow this model of full accessibility: indeed, vehicles' consumption will be highly correlated (peak hours - commuting between home and office in the mornings and evenings) and networks should be resized drastically to take these problems into account.

"Load flow" refers to the calculation of the distribution system according to the charges. If there are many projects or software for load flow simulation, none of them offer the performance to achieve a decent responsiveness. For example, in a given area, if 10 000 vehicles connect within 30 minutes, each vehicle would require a response within 20ms. This is not possible due to budget constraints as it would require the use of a supercomputer or a grid computer in every area that has to be considered (car park, city, region, ...).

Goal: The purpose of this thesis is to propose mechanisms that will lead to several implementations of the same algorithm regarding various objectives in the context of multicore architecture. As far as multicore architectures are concerned, improving performance is the main aim when developing applications such as the one presented above.

However new challenges need to be alleviated in order to achieve performance for a reasonable investment. These challenges can be represented by the trade-off between two conflicting goals. On one side, opacity is required to hide to the programmer unnecessary details about the hardware, the memory hierarchy and the communication network. On another side, visibility of the fundamental elements is necessary to let the programmer harness the power of today's multicore architectures.

However, additional constraints have to be taken into account. Among them, power consumption is nearly as important as performance. Usually the former is impacting the latter and vice versa. In this context, several implementations of the kernel should have to be considered depending on the targeted hardware or the workload of the architecture. This is kernel versioning.

The objectives of this thesis is to define a framework where it will be possible to extend the usual criteria that leads to multi-versioning with for example:

Hardware constraints: We are proposing to consider the usual constraints based on hardware characteristics, like memory, network interconnect and communication requirements of the application.

Context dependent constraints: For a given task several algorithms can be considered depending on the size of the data or the objectives to achieve (power, latency or bandwidth).

Regularity constraints: Regularity is essential when considering parallel applications. Depending on the architecture, we have to consider regularity on data structures, data values, execution or control.

Reliability constraints: Depending on the hardware, we may have a memory hierarchy with or without ECC. We may want to handle hardware errors that happen during computation by duplicating them either in time or in space according to the architecture.

Accuracy constraints: Multicore architectures make use of many floating-point units. These units implement the IEEE 754 standard, in which the 2008 revision introduces various representation formats, ranging from 16 to 128 bits. This encourages the adoption of mixed-precision in floating-point based algorithms.

Proposed implementations will be tested and validated by an industrial partner of this project.

Printing OpenCL binary...

2011-06-20T17:29:00.021+02:00

I found this snippet (originally in C++) somewhere that I can't remember. Some changes to standard C and voilà how to write out your OpenCL program binary:

Just replace the cpProgram variable by yours. See below an example of result (NVidia PTX code):

OpenCL programs in Latex (listings package)

2011-06-13T18:55:00.006+02:00

When we need to put some snippets of OpenCL in Latex documents usually we use C syntax definition for the listings package. I've added the definitions below to lstlang1.sty file(you can find it in the shared directory of your tex). Put this just after your C++ language definitions and before Objective C.

 \lst@definelanguage[OpenCL]{C}[ANSI]{C} 
{morekeywords={__kernel,kernel,__local,local,__global,global,% 
__constant,constant,__private,private,% 
char2,char3,char4,char8,char16,% 
uchar2,uchar3,uchar4,uchar8,uchar16,% 
short2,short3,short4,short8,short16,% 
ushort2,ushort3,ushort4,ushort8,ushort16,% 
int2,int3,int4,int8,int16,% 
uint2,uint3,uint4,uint8,uint16,% 
long2,long3,long4,long8,long16,% 
ulong2,ulong3,ulong4,ulong8,ulong16,% 
float2,float3,float4,float8,float16,% 
image2d_t,image3d_t,sampler_t,event_t,% 
bool2,bool3,bool4,bool8,bool16,% 
half2,half3,half4,half8,half16,% 
quad,quad2,quad3,quad4,quad8,quad16,% 
complex,imaginary},% 
}%

Then, setup your listing environment like this example:

 \lstset{language=[OpenCL]C,caption={Generated Kernel}, 
label=case:listing2, 
basicstyle=\tiny, 
backgroundcolor=\color{whitegray}, 
numbers=left, 
numberstyle=\tiny}

Hint: Alternatively, you can put this configuration directly in your tex project. Copy and paste it along your \usepackage{listings} replacing "\lst@definelanguage" by "\lstdefinelanguage".

GPU and CPU Double Precision Performance

2011-06-10T19:40:00.025+02:00

After checking some data from wikipedia, vendor's specifications and so on, I made this chart above. The chart is about GPU and CPU performance in double precision. While the Intel Core I7 980X (extreme edition) gives us around 110GFLOPS (Source: Tom's Hardware), GPUs such as AMD Radeon 6970 and NVidia C2090 offer more than 660GFLOPS. Obviously these benchmarks represent peak performance under specific conditions for each platform.

OpenCL Books

2011-06-07T23:59:00.014+02:00

OpenCL Programming Guide

By: Aaftab Munshi; Benedict Gaster; Timothy G. Mattson; James Fung; Dan Ginsburg
Publisher: Addison-Wesley Professional

Chapter 1. An Introduction to OpenCL
Chapter 2. HelloWorld: An OpenCL Example

Chapter 3. Platforms, Contexts, and Devices

Chapter 4. Programming with OpenCL C

Chapter 5. OpenCL C Built-in Functions

Chapter 6. Programs and Kernels

Chapter 7. Buffers and Sub-Buffers

Chapter 8. Images and Samplers

Chapter 9. Events

Chapter 10. Interoperability with OpenGL

Chapter 11. Interoperability with Direct3D

Chapter 12. C++ Wrapper API

Chapter 13. OpenCL Embedded Profile

Chapter 14. Image Histogram

Chapter 15. Sobel Edge Detection Filter

Chapter 16. Parallelizing Dikjstra’s Single Source Shortest Path Graph Algorithm

Chapter 17. Cloth Simulation in the Bullet Physics SDK

Chapter 18. Simulating the Ocean with Fast Fourier Transform

Chapter 19. Optical Flow

Chapter 20. Using OpenCL with PyOpenCL

Chapter 21. Matrix Multiplication with OpenCL

Chapter 22. Sparse Matrix-Vector Multiplication

Appendix A. Summary of OpenCL 1.1

The OpenCL Programming Book

Publisher: Fixstars Corporation Author: Fixstars Corporation (Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki)

Introduction to Parallelization

Why Parallell

Parallel Computing (Hardware)

Parallel Computing (Software)

Conclusion

OpenCL

What is OpenCL?

Historical Background

An Overview of OpenCL

Why OpenCL?

Applicable Platforms

OpenCL Setup

Available OpenCL Environments

Developing Environment Setup

First OpenCL Program

Basic OpenCL

Basic Program Flow

Online/Offline Compilation

Calling the Kernel

Advanced OpenCL

OpenCL C

OpenCL Programming Practice

Case Study

FFT (Fast Fourier Transform)

Mersenne Twister

Notes

Heterogeneous Computing with OpenCL

Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, Dana Schaa

Introduction to Parallel Programming
Introduction to OpenCL
OpenCL Device Architectures
Basic OpenCL Examples
Understanding OpenCL's Concurrency and Execution Model
Dissecting a CPU/GPU OpenCL Implementation
OpenCL Case Study: Convolution
OpenCL Case Study: Video Processing
OpenCL Case Study: Histogram
OpenCL Case Study: Mixed Particle Simulation
OpenCL Extensions
OpenCL Profiling and Debugging
WebCL

AMD Radeon™ E6760 Embedded GPU OpenCL Compliant

2011-06-06T20:32:00.008+02:00

AMD has released the first embedded GPU offering suport for OpenCL.

More information:

http://www.amd.com/us/press-releases/Pages/embedded-gpu-2011may02.aspx

http://www.amd.com/us/Documents/49807_E6760_GPU_brief.pdf

Why does NVidia not do the same for Tegra?

Anyway, let's create OpenCL HP applications for embedded systems.

Anjuta Project Wizards for AMD, NVidia and Intel OpenCL SDK

2011-06-03T02:40:00.060+02:00

Aiming at increasing the OpenCL developing, I created some wizards to start up an OpenCL application project using the SDK from NVidia, AMD or Intel. I've used Anjuta DevStudio on Linux.

This is just a first approach, so don't be disappointed if you need to do some changes. The wizards give you a simple functional code and you can work on this one.

System requirements (they depend on your goal):

NVidia: GPU Computing SDK code samples, CUDA Toolkit (http://developer.nvidia.com)
AMD APP SDK (http://developer.amd.com)
INTEL OpenCL SDK (http://software.intel.com/en-us/articles/opencl-sdk)
Anjuta DevStudio 2.30+

OpenCL Syntax Highlighting

Usually, files with .cl extension don't have any associated language specification. Download the opencl.lang (http://www.streamcomputing.eu/downloads/?opencl.lang, by the way, thanks to streamcomputing blog) and copy it to your ~/.local/share/gtksourceview-2.0/language-specs directory. Make a test opening a .cl file with gedit.

Anjuta DevStudio

It is a versatile software development studio featuring a number of advanced programming facilities including project management, application wizard, interactive debugger, source editor, version control, GUI designer, profiler and many more tools. It focuses on providing simple and usable user interface, yet powerful for efficient development. [http://projects.gnome.org/anjuta/]

New OpenCL Wizards

Create (if that's not there) your ~/.local/share/anjuta/project

~/.local/share/anjuta/project$ wget https://sites.google.com/site/wendellrodrigues/projects/OpenCLWizards.tgz

~/.local/share/anjuta/project$ tar xvzf OpenCLWizards.tgz

Open anjuta and try to create a New Project.

Now we can see new types of project wizard in the C tab: AMD and Intel and in the C++ tab: AMD, Intel and NVidia.

AMD APP SDK

Creating a C project:

For C++ applications the process is similar. The file templates are based on Template and TemplateC from the SDK and the projects will be created (default) in "samples/opencl/myprojects/app". If you change the default destination, please be careful of relative references in the Makefile.

NVidia SDK

For NVidia applications we have only the C++. Even though the samples' code are standard C, the provided Makefile and includes are made targeting g++. That's not a problem and you can write your code in C as well as C++. The wizard will ask you for the "NVIDIA_GPU_Computing_SDK" path. By default, projects will be created in "OpenCL/myprojects".

Intel SDK

Like for AMD SDK we have C and C++ wizards. But, they use Makefiles generated by the autotools. Intel SDK in Linux environments is not yet stable, so these wizards regard the folder where you have installed the lib64(yes, for Linux we have, until now, just the 64bits version) and include folders provided by the SDK package. By default projects will be installed at the same include and lib64 directory level.

Moreover, the Makefiles take into account these points:

CL includes: -I../../include
CL Dynamic Library: -lOpenCL (located at lib64/libOpenCL.so)

To run your OpenCL programs sometimes you need to define the work directory, executable and possible parameters.

Final Considerations

I hope these wizards give you some curiosity to start programming in OpenCL. Feel free to ask me if you have any questions or just to comment.

OpenCL Group in Linkedin

2011-05-28T00:04:00.006+02:00

I find important sharing information among people using OpenCL. Recently, I joined to OpenCL group in Linkedin. Some discussions and points of view that you'll find there:

the lack of domain specific libraries compared to CUDA, for instance. However, AMD has released stuff like http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx. Anyway, we should have always in mind that OpenCL is younger than CUDA.
questions about why use OpenCL in environments without GPUs were answered by people arguing about the unknown user space (where your app will run) and stuff like http://www.khronos.org/developers/library/2010_siggraph_bof_opencl/OpenCL-BOF-Intel-SIGGRAPH-Jul10.pdf that points the direction of applications implemented on Intel's SDK.

The group is open and some helpful and qualified helpful members are creating interesting discussions.

Parallel Primitives Library

2011-05-27T23:49:00.010+02:00

Lets try to improve OpenCL libraries for everything...

Libs like this:

clpp is an OpenCL Data Parallel Primitives Library. It is a library of data-parallel algorithm primitives such as parallel-prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables.

If you want to join the project, please simply send a message to our mailing list: http://groups.google.com/group/cl-pp

Thanks to Polar Lights?? initiative. ;)

Profiling your application

2011-05-25T15:44:00.008+02:00

I was away for a time, but I'm coming back. In my recent projects, I had taken a few results from profiling tools. For a better understanding, I grabbed some resources from Marcus Bannerman's website. Putting the link...

Profiling in Linux

I didn't ask no authorization to publish it in this blog, but I'm sure that he will not be upset about that.

Take a look at other articles from his website. There are a lot of stuff about OpenCL http://www.marcusbannerman.co.uk

Laboratoire Jacques-Louis Lions - Paris VI: Gaspard2 and OpenCL Presentation

2010-04-30T17:57:00.003+02:00

Gaspard2 and OpenCL on Prezi

Global Synchronization??

2010-04-30T17:41:00.003+02:00

ATI Radeon™ HD 5870 (“Cypress”) Architecture (2009) has Global Synhcronization. According to its specification:
2.72 Teraflops Single Precision,544 Gigaflops Double Precision

Full Hardware Implementation of DirectCompute 11 and OpenCL™ 1.0
IEEE754-2008 Compliance Enhancements
Additional Compute Features:
32-bit Atomic Operations
Flexible 32kB Local Data Shares
64kB Global Data Share
Global synchronization
Append/consume buffers

I searched for more information about it, examples in OpenCL, etc. but I didn't find. If you have any tips...write me.

Eyeon Software Unveils Fusion 6.1 with OpenCL Supercomputing

2010-04-22T13:47:00.001+02:00

From Khronos Group News:

Fusion 6.1 now supports the OpenCL language, which allows tools to take advantage of the GPU in modern NVIDIA and ATI graphics cards to achieve tremendous speed increases. Benchmarking improvements of up to 1000% on some of the most processor-intensive tools in Fusion (such as Defocus and Noise generators) have been reached. You can also insert OpenCL code directly into Fuse tools to create in-house GPU-accelerated tools.

http://www.eyeonline.com/Web/EyeonWeb/Press/DisplayArticle.aspx?articleid=406

Code Generation for OpenCL (Prezi Presentation)

2010-04-13T14:09:00.006+02:00

I'm working hard to code our compiler to opencl source code. Meanwhile, I posted this presentation made on Prezi for the Valeo PhD Students Seminary.

Gaspard2, GPU and Valeo on Prezi

OpenCL Kernel for Scalar Product with Atomic Operations

2009-12-08T20:03:00.011+01:00

The final sum of the dotproduct example is implemented on CPU. This is a solution of Scalar Product (DotProduct) without final reduction on the host side. This example uses atomic operations.


/*
* sDOT OpenCL Kernel Function for Level 1 BLAS Dot Product dot<-xy * Author; Wendell Rodrigues 
* INRIA-Lille :: DaRT Team
*/
__kernel void sDOT(
__global const unsigned int N,
__global const float* X,
__global const float* Y,
__global float* DOT,
__global int* FLAG,
__local float* sdata
)
{
// get index into global data array
unsigned int tid = get_local_id(0);
unsigned int i = get_global_id(0);

sdata[tid] = (i<N) ? X[i]*Y[i] : 0;

if (i==0) {
DOT[0]=0;
*FLAG=0;
}

barrier(CLK_LOCAL_MEM_FENCE);


// do reduction in shared mem
for(unsigned int s=1; s < get_local_size(0); s *= 2)
{
  int index = 2 * s * tid;

  if (index < get_local_size(0))
  {
   sdata[index] += sdata[index + s];
  }

  barrier(CLK_LOCAL_MEM_FENCE);
}

// write result for this block to global mem
if (tid == 0) {
 while (atom_cmpxchg(FLAG,0,1)==1);
 DOT[0] += sdata[0];
 atom_cmpxchg(FLAG,1,0);
}

}

Modeling Challenges on OpenCL Code Generation

2009-11-10T12:52:00.002+01:00

This week, I presented an overview of the integration of OpenCL and Gaspard2. In order to overcome the many challenges of model conception and transformations, we need study a new MoC (other than Array-OL), coalescent memory allocation and task distribute. The slides have two good examples of OpenCL applications and some questions about model conception and code generation.

PDF

OpenWF - The Standard for building composited windowing systems

2009-11-10T12:43:00.003+01:00

From Khronos Group:

Embedded devices are increasingly expected to offer sophisticated user interfaces that combine rich graphics with multimedia content. Graphics and display hardware technologies have evolved to achieve these visuals with significantly higher efficiency than traditional CPUs, delivering greater performance, decreasing memory bandwidth usage and increasing battery life. Making use of this variety of hardware introduces fragmentation as software needs to be adapted to each hardware configuration.

A platform’s Hardware Abstraction Layer (HAL) for display and graphics technology allows the applications and middleware layers above to be deployed across a range of hardware without costly porting activities. OpenGL is an example of a graphics HAL that allows portable software to take advantage of a wide range of 3D hardware accelerators.

Windowing systems allow screens to be shared by multiple applications, ensuring that the graphics provided for each application’s window is sensibly merged onto the screen. This requires the graphics and display drivers to respect the intentions of the windowing system, which commonly means considerable OS-specific porting work on the part of the device manufacturer when moving to new hardware.

The OpenWF APIs provide an OS-independent and hardware-neutral foundation for building compositing systems, particularly suited to implementing windowing systems. OpenWF acts as a HAL to achieve composition of content and configuration of display devices. The interfaces are designed for use by a single user which could be a central windowing system or, in an application-specific system, may be the application itself.

http://www.khronos.org/openwf/

Conjugate Gradient and OpenCL

2009-10-23T17:18:00.003+02:00

I've just finished a conjugate gradient implementation for OpenCL. It has not performance yet, but I'm working on this to fix the bugs and/or optimize the code.

Here you are a PDF that makes an overview on the subject.

Nvidia's Next Generation: Fermi - key architectural highlights

2009-10-16T23:57:00.007+02:00

Third Generation Streaming Multiprocessor (SM)
32 CUDA cores per SM, 4x over GT200
8x the peak double precision floating point performance over GT200
Dual Warp Scheduler simultaneously schedules and dispatches instructions from two independent warps
64 KB of RAM with a configurable partitioning of shared memory and L1 cache

Second Generation Parallel Thread Execution ISA
Unified Address Space with Full C++ Support
Optimized for OpenCL and DirectCompute
Full IEEE 754-2008 32-bit and 64-bit precision
Full 32-bit integer path with 64-bit extensions
Memory access instructions to support transition to 64-bit addressing
Improved Performance through Predication

Improved Memory Subsystem
NVIDIA Parallel DataCache™ hierarchy with Configurable L1 and Unified L2
Caches
First GPU with ECC memory support
Greatly improved atomic memory operation performance

NVIDIA GigaThread™ Engine
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution
Dual overlapped memory transfer engines
more information: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

ATI Stream Software Development Kit (SDK) v2.0 Beta Program

2009-10-15T11:13:00.001+02:00

What’s New in v2.0-beta4

First beta release of ATI Stream SDK with OpenCL™ GPU support.
ATI Stream SDK v2.0 OpenCL™ is certified OpenCL™ 1.0 conformant by Khronos.
Added Microsoft® Windows® 7 support.
Added native Microsoft® Windows® 64-bit support.
Float comparisons in kernels no longer produce a runtime error.
Various other issues from previous v2.0 beta releases have been resolved.
More information: http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx

OpenCL BLAS - Makefile for MAC

2009-10-08T12:12:00.002+02:00

Thanks to Mario Rometsch for a version of OpenCL BLAS Makefile for MacOS. You can download it on the SourceForge.

OpenCL BLAS Makefile for MacOS

BLAS Library for OpenCL

2009-09-21T15:26:00.012+02:00

I use the conjugate gradient solver without preconditioners to solve a linear system Ax=b, where A is a sparse matrix. This method is iterative and uses some BLAS functions like Dot Product, Scalar Product, xAXPY and xGEMV (SpMV for sparse matrix).I've started to develop these functions for the OpenCL language and I've decided to share them.

Right now, the following BLAS level 1 functions are available:
sDOT :: single precision dot product or scalar product (dot<-xy)
sNRM2 :: single precision vector 2-norm
sSCAL :: single precision product of vector by scalar (x<-ax)
sAXPY :: single precision AXPY (y<-ax + y) You can download the OpenCL code which was tested on NVIDIA Tesla C870 and GPU Computing SDK 2.3

SourceForge Project

Please join up with your contribution!

Update: OpenCL BLAS now is a discontinued project.

ATI Stream Software Development Kit (SDK) v2.0 Beta Program With OpenCL™ 1.0 Support

2009-09-14T16:29:00.004+02:00

With ATI Stream SDK, AMD/ATI provides a way to program OpenCL on its cards. I didn't download it yet, but you can get more information on:

http://developer.amd.com/GPU/ATISTREAMSDKBETAPROGRAM/Pages/default.aspx

and on the OpenCL/ATI forum:
http://forums.amd.com/devforum/categories.cfm?catid=390&entercat=y

I'm going to test it and I will post here an overview.

GPU and Matlab

2009-09-11T17:47:00.004+02:00

If you like programming Matlab-like environment, I suggest the freeware GPUMat from GP-you Group. You can explore the power of GPUs, BLAS and FFT libraries on NVIDIA cards. You can get more information on:
GP-you Group