Wednesday, July 6, 2011

2 PhD studentships in Programming and Design of Low-power Heterogeneous Multi-Core Systems

The University of Edinburgh and ARM invites applications for two PhD studentships in the general area of heterogeneoous multi-cores. These positions are supported by ARM as part of the ARM Research Centre of Excellence at U.Edinburgh.  
These fully funded PhD positions are open to students of **all** nationalities.  
There is a wide range of topics covered by the centre and includes (but is not limited to):
  • Mapping and scheduling for low power on Heterogeneous Multi-Processing
  • Programming for next generation CPU and GPGPU systems
  • Software and hardware to making multi-processing accessible to programmers
  • Parallelism discovery and automatic parallelisation of sequential programs
  • Compilation for low power
  • Compilers and runtime for next generation ARM architectures
  • Data-centre scale parallelism
  • Hardware assistance in detection of non-data race free concurrent programs
  • Security between cloud and terminal
  • High performance low power micro-architecture
Suitable candidates will have a strong first degree in Computer Science and a strong interest in parallel programming, optimizing compilers, runtime systems, computer architecture, security or machine learning.  The exact topic of the PhD depends on the candidate's interests.  We are looking for the brightest minds to pursue research in a cutting-edge arena.
The start date is flexible.
Candidates are encouraged to contact Michael O'Boyle to informally discuss the project further. More information on our work can be found at: 
Formal application will then be through the School's normal PhD application process:

Tuesday, June 21, 2011

PhD Proposal: University of Perpignan

Title: Real-­time implementations of power flow algorithms in the context of multiprocessor.

Keywords: Power-Flow analysis, electrical network, multiprocessor, High Performance Computing under constraint, OpenCL

Supervisor/contact: David Defour (david.defour'AT'

Location: Perpignan/France Scientific

Context: Electrical network simulation monopolises considerable efforts in today's international community. Currently the networks are statically sized to the expected and projected demand and which is said to be completely accessible. The electrical network of the future, especially if one takes into account the needs of energy required by electrical vehicles, can no longer follow this model of full accessibility: indeed, vehicles' consumption will be highly correlated (peak hours - commuting between home and office in the mornings and evenings) and networks should be resized drastically to take these problems into account.

"Load flow" refers to the calculation of the distribution system according to the charges. If there are many projects or software for load flow simulation, none of them offer the performance to achieve a decent responsiveness. For example, in a given area, if 10 000 vehicles connect within 30 minutes, each vehicle would require a response within 20ms. This is not possible due to budget constraints as it would require the use of a supercomputer or a grid computer in every area that has to be considered (car park, city, region, ...).

Goal: The purpose of this thesis is to propose mechanisms that will lead to several implementations of the same algorithm regarding various objectives in the context of multicore architecture. As far as multicore architectures are concerned, improving performance is the main aim when developing applications such as the one presented above.

However new challenges need to be alleviated in order to achieve performance for a reasonable investment. These challenges can be represented by the trade-off between two conflicting goals. On one side, opacity is required to hide to the programmer unnecessary details about the hardware, the memory hierarchy and the communication network. On another side, visibility of the fundamental elements is necessary to let the programmer harness the power of today's multicore architectures.

However, additional constraints have to be taken into account. Among them, power consumption is nearly as important as performance. Usually the former is impacting the latter and vice versa. In this context, several implementations of the kernel should have to be considered depending on the targeted hardware or the workload of the architecture. This is kernel versioning.

The objectives of this thesis is to define a framework where it will be possible to extend the usual criteria that leads to multi-versioning with for example:
Hardware constraints: We are proposing to consider the usual constraints based on hardware characteristics, like memory, network interconnect and communication requirements of the application.
Context dependent constraints: For a given task several algorithms can be considered depending on the size of the data or the objectives to achieve (power, latency or bandwidth).
Regularity constraints: Regularity is essential when considering parallel applications. Depending on the architecture, we have to consider regularity on data structures, data values, execution or control.
Reliability constraints: Depending on the hardware, we may have a memory hierarchy with or without ECC. We may want to handle hardware errors that happen during computation by duplicating them either in time or in space according to the architecture.
Accuracy constraints: Multicore architectures make use of many floating-point units. These units implement the IEEE 754 standard, in which the 2008 revision introduces various representation formats, ranging from 16 to 128 bits. This encourages the adoption of mixed-precision in floating-point based algorithms.

Proposed implementations will be tested and validated by an industrial partner of this project.

Monday, June 20, 2011

Printing OpenCL binary...

I found this snippet (originally in C++) somewhere that I can't remember. Some changes to standard C and voilà how to write out your OpenCL program binary:

Just replace the cpProgram variable by yours. See below an example of result (NVidia PTX code):

Monday, June 13, 2011

OpenCL programs in Latex (listings package)

When we need to put some snippets of OpenCL in Latex documents usually we use C syntax definition for the listings package. I've added the definitions below to lstlang1.sty file(you can find it in the shared directory of your tex). Put this just after your C++ language definitions and before Objective C.


Then, setup your listing environment like this example:

 \lstset{language=[OpenCL]C,caption={Generated Kernel}, 

Hint: Alternatively, you can put this configuration directly in your tex project. Copy and paste it along your \usepackage{listings} replacing "\lst@definelanguage" by "\lstdefinelanguage".

Friday, June 10, 2011

GPU and CPU Double Precision Performance

After checking some data from wikipedia, vendor's specifications and so on, I made this chart above. The chart is about GPU and CPU performance in double precision. While the Intel Core I7 980X (extreme edition) gives us around 110GFLOPS (Source: Tom's Hardware), GPUs such as AMD Radeon 6970 and NVidia C2090 offer more than 660GFLOPS. Obviously these benchmarks represent peak performance under specific conditions for each platform.

Tuesday, June 7, 2011

OpenCL Books

OpenCL Programming Guide
Chapter 1. An Introduction to OpenCL
Chapter 2. HelloWorld: An OpenCL Example
Chapter 3. Platforms, Contexts, and Devices
Chapter 4. Programming with OpenCL C
Chapter 5. OpenCL C Built-in Functions
Chapter 6. Programs and Kernels
Chapter 7. Buffers and Sub-Buffers
Chapter 8. Images and Samplers
Chapter 9. Events
Chapter 10. Interoperability with OpenGL
Chapter 11. Interoperability with Direct3D
Chapter 12. C++ Wrapper API
Chapter 13. OpenCL Embedded Profile
Chapter 14. Image Histogram
Chapter 15. Sobel Edge Detection Filter
Chapter 16. Parallelizing Dikjstra’s Single Source Shortest Path Graph Algorithm
Chapter 17. Cloth Simulation in the Bullet Physics SDK
Chapter 18. Simulating the Ocean with Fast Fourier Transform
Chapter 19. Optical Flow
Chapter 20. Using OpenCL with PyOpenCL
Chapter 21. Matrix Multiplication with OpenCL
Chapter 22. Sparse Matrix-Vector Multiplication
Appendix A. Summary of OpenCL 1.1

The OpenCL Programming Book
Publisher: Fixstars Corporation Author: Fixstars Corporation (Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki)
Introduction to Parallelization
Why Parallell
Parallel Computing (Hardware)
Parallel Computing (Software)
What is OpenCL?
Historical Background
An Overview of OpenCL
Why OpenCL?
Applicable Platforms
OpenCL Setup
Available OpenCL Environments
Developing Environment Setup
First OpenCL Program
Basic OpenCL
Basic Program Flow
Online/Offline Compilation
Calling the Kernel
Advanced OpenCL
OpenCL C
OpenCL Programming Practice
Case Study
FFT (Fast Fourier Transform)
Mersenne Twister

Heterogeneous Computing with OpenCL
Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry, Dana Schaa
  1. Introduction to Parallel Programming
  2. Introduction to OpenCL
  3. OpenCL Device Architectures
  4. Basic OpenCL Examples
  5. Understanding OpenCL's Concurrency and Execution Model
  6. Dissecting a CPU/GPU OpenCL Implementation
  7. OpenCL Case Study: Convolution
  8. OpenCL Case Study: Video Processing
  9. OpenCL Case Study: Histogram
  10. OpenCL Case Study: Mixed Particle Simulation
  11. OpenCL Extensions
  12. OpenCL Profiling and Debugging
  13. WebCL

Monday, June 6, 2011

AMD Radeon™ E6760 Embedded GPU OpenCL Compliant

AMD has released the first embedded GPU offering suport for OpenCL.
More information:

Why does NVidia not do the same for Tegra?

Anyway, let's create OpenCL HP applications for embedded systems.