GPU/OpenCL Modeling: June 2011

Tuesday, June 21, 2011

PhD Proposal: University of Perpignan

Title: Real-time implementations of power flow algorithms in the context of multiprocessor.

Keywords: Power-Flow analysis, electrical network, multiprocessor, High Performance Computing under constraint, OpenCL

Supervisor/contact: David Defour (david.defour'AT'univ-perp.fr)

Location: Perpignan/France Scientific

Context: Electrical network simulation monopolises considerable efforts in today's international community. Currently the networks are statically sized to the expected and projected demand and which is said to be completely accessible. The electrical network of the future, especially if one takes into account the needs of energy required by electrical vehicles, can no longer follow this model of full accessibility: indeed, vehicles' consumption will be highly correlated (peak hours - commuting between home and office in the mornings and evenings) and networks should be resized drastically to take these problems into account.

"Load flow" refers to the calculation of the distribution system according to the charges. If there are many projects or software for load flow simulation, none of them offer the performance to achieve a decent responsiveness. For example, in a given area, if 10 000 vehicles connect within 30 minutes, each vehicle would require a response within 20ms. This is not possible due to budget constraints as it would require the use of a supercomputer or a grid computer in every area that has to be considered (car park, city, region, ...).

Goal: The purpose of this thesis is to propose mechanisms that will lead to several implementations of the same algorithm regarding various objectives in the context of multicore architecture. As far as multicore architectures are concerned, improving performance is the main aim when developing applications such as the one presented above.

However new challenges need to be alleviated in order to achieve performance for a reasonable investment. These challenges can be represented by the trade-off between two conflicting goals. On one side, opacity is required to hide to the programmer unnecessary details about the hardware, the memory hierarchy and the communication network. On another side, visibility of the fundamental elements is necessary to let the programmer harness the power of today's multicore architectures.

However, additional constraints have to be taken into account. Among them, power consumption is nearly as important as performance. Usually the former is impacting the latter and vice versa. In this context, several implementations of the kernel should have to be considered depending on the targeted hardware or the workload of the architecture. This is kernel versioning.

The objectives of this thesis is to define a framework where it will be possible to extend the usual criteria that leads to multi-versioning with for example:

Hardware constraints: We are proposing to consider the usual constraints based on hardware characteristics, like memory, network interconnect and communication requirements of the application.

Context dependent constraints: For a given task several algorithms can be considered depending on the size of the data or the objectives to achieve (power, latency or bandwidth).

Regularity constraints: Regularity is essential when considering parallel applications. Depending on the architecture, we have to consider regularity on data structures, data values, execution or control.

Reliability constraints: Depending on the hardware, we may have a memory hierarchy with or without ECC. We may want to handle hardware errors that happen during computation by duplicating them either in time or in space according to the architecture.

Accuracy constraints: Multicore architectures make use of many floating-point units. These units implement the IEEE 754 standard, in which the 2008 revision introduces various representation formats, ranging from 16 to 128 bits. This encourages the adoption of mixed-precision in floating-point based algorithms.

Proposed implementations will be tested and validated by an industrial partner of this project.

Monday, June 20, 2011

Printing OpenCL binary...

I found this snippet (originally in C++) somewhere that I can't remember. Some changes to standard C and voilà how to write out your OpenCL program binary:

Just replace the cpProgram variable by yours. See below an example of result (NVidia PTX code):

Monday, June 13, 2011

OpenCL programs in Latex (listings package)

When we need to put some snippets of OpenCL in Latex documents usually we use C syntax definition for the listings package. I've added the definitions below to lstlang1.sty file(you can find it in the shared directory of your tex). Put this just after your C++ language definitions and before Objective C.

 \lst@definelanguage[OpenCL]{C}[ANSI]{C} 
{morekeywords={__kernel,kernel,__local,local,__global,global,% 
__constant,constant,__private,private,% 
char2,char3,char4,char8,char16,% 
uchar2,uchar3,uchar4,uchar8,uchar16,% 
short2,short3,short4,short8,short16,% 
ushort2,ushort3,ushort4,ushort8,ushort16,% 
int2,int3,int4,int8,int16,% 
uint2,uint3,uint4,uint8,uint16,% 
long2,long3,long4,long8,long16,% 
ulong2,ulong3,ulong4,ulong8,ulong16,% 
float2,float3,float4,float8,float16,% 
image2d_t,image3d_t,sampler_t,event_t,% 
bool2,bool3,bool4,bool8,bool16,% 
half2,half3,half4,half8,half16,% 
quad,quad2,quad3,quad4,quad8,quad16,% 
complex,imaginary},% 
}%

Then, setup your listing environment like this example:

 \lstset{language=[OpenCL]C,caption={Generated Kernel}, 
label=case:listing2, 
basicstyle=\tiny, 
backgroundcolor=\color{whitegray}, 
numbers=left, 
numberstyle=\tiny}

Hint: Alternatively, you can put this configuration directly in your tex project. Copy and paste it along your \usepackage{listings} replacing "\lst@definelanguage" by "\lstdefinelanguage".

Friday, June 10, 2011

GPU and CPU Double Precision Performance

After checking some data from wikipedia, vendor's specifications and so on, I made this chart above. The chart is about GPU and CPU performance in double precision. While the Intel Core I7 980X (extreme edition) gives us around 110GFLOPS (Source: Tom's Hardware), GPUs such as AMD Radeon 6970 and NVidia C2090 offer more than 660GFLOPS. Obviously these benchmarks represent peak performance under specific conditions for each platform.

Tuesday, June 7, 2011

OpenCL Books

OpenCL Programming Guide

By: Aaftab Munshi; Benedict Gaster; Timothy G. Mattson; James Fung; Dan Ginsburg
Publisher: Addison-Wesley Professional

Chapter 1. An Introduction to OpenCL
Chapter 2. HelloWorld: An OpenCL Example

Chapter 3. Platforms, Contexts, and Devices

Chapter 4. Programming with OpenCL C

Chapter 5. OpenCL C Built-in Functions

Chapter 6. Programs and Kernels

Chapter 7. Buffers and Sub-Buffers

Chapter 8. Images and Samplers

Chapter 9. Events

Chapter 10. Interoperability with OpenGL

Chapter 11. Interoperability with Direct3D

Chapter 12. C++ Wrapper API

Chapter 13. OpenCL Embedded Profile

Chapter 14. Image Histogram

Chapter 15. Sobel Edge Detection Filter

Chapter 16. Parallelizing Dikjstra’s Single Source Shortest Path Graph Algorithm

Chapter 17. Cloth Simulation in the Bullet Physics SDK

Chapter 18. Simulating the Ocean with Fast Fourier Transform

Chapter 19. Optical Flow

Chapter 20. Using OpenCL with PyOpenCL

Chapter 21. Matrix Multiplication with OpenCL

Chapter 22. Sparse Matrix-Vector Multiplication

Appendix A. Summary of OpenCL 1.1

The OpenCL Programming Book

Publisher: Fixstars Corporation Author: Fixstars Corporation (Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki)

Introduction to Parallelization

Why Parallell

Parallel Computing (Hardware)

Parallel Computing (Software)

Conclusion

OpenCL

What is OpenCL?

Historical Background

An Overview of OpenCL

Why OpenCL?

Applicable Platforms

OpenCL Setup

Available OpenCL Environments

Developing Environment Setup

First OpenCL Program

Basic OpenCL

Basic Program Flow

Online/Offline Compilation

Calling the Kernel

Advanced OpenCL

OpenCL C

OpenCL Programming Practice

Case Study

FFT (Fast Fourier Transform)

Mersenne Twister

Notes

Heterogeneous Computing with OpenCL

Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, Dana Schaa

Introduction to Parallel Programming
Introduction to OpenCL
OpenCL Device Architectures
Basic OpenCL Examples
Understanding OpenCL's Concurrency and Execution Model
Dissecting a CPU/GPU OpenCL Implementation
OpenCL Case Study: Convolution
OpenCL Case Study: Video Processing
OpenCL Case Study: Histogram
OpenCL Case Study: Mixed Particle Simulation
OpenCL Extensions
OpenCL Profiling and Debugging
WebCL

Monday, June 6, 2011

AMD Radeon™ E6760 Embedded GPU OpenCL Compliant

AMD has released the first embedded GPU offering suport for OpenCL.

More information:

http://www.amd.com/us/press-releases/Pages/embedded-gpu-2011may02.aspx

http://www.amd.com/us/Documents/49807_E6760_GPU_brief.pdf

Why does NVidia not do the same for Tegra?

Anyway, let's create OpenCL HP applications for embedded systems.

Friday, June 3, 2011

Anjuta Project Wizards for AMD, NVidia and Intel OpenCL SDK

Aiming at increasing the OpenCL developing, I created some wizards to start up an OpenCL application project using the SDK from NVidia, AMD or Intel. I've used Anjuta DevStudio on Linux.

This is just a first approach, so don't be disappointed if you need to do some changes. The wizards give you a simple functional code and you can work on this one.

System requirements (they depend on your goal):

NVidia: GPU Computing SDK code samples, CUDA Toolkit (http://developer.nvidia.com)
AMD APP SDK (http://developer.amd.com)
INTEL OpenCL SDK (http://software.intel.com/en-us/articles/opencl-sdk)
Anjuta DevStudio 2.30+