Laboration 6: Introduction to OpenCL

Important! Work in progress! The lab series will be revised for the 2014 course! The material in the following links may be heavily altered before the labs start. The changes will not appear immediately at the course start but some time before the official first Lab 4 session.

This is our lab on OpenCL. It partially builds on an older lab, designed by Jens Ogniewski, but is quite a bit scaled down and simplified.

Some vital notes on OpenCL

CUDA's threadsynchronize() is now called barrier(), with some related calls in the mem_fence() family. You also have to specify what memory access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or both (with the binary or operator "|").

CUDA's shared memory is now called local, declared __local.

Your source code is not part of the main program but must be loaded as a text buffer.

1. Getting started: Hello, World!

Here is our own Hello World! for OpenCL:


Like all my parallel Hello World! examples, it uses the string "Hello " plus an array of offsets to produce "World!", thus making a tiny parallel computation and still does what it should, produces the string "Hello World!". This is, however, quite a bit longer and harder to read than the CUDA example.

Compile with

(IDA:) gcc hello_world_cl.c CLutilities.c -lOpenCL -I/sw/cuda/4.2/include -o hello_world_cl

(ISY:) gcc hello_world_cl.c CLutilities.c -lOpenCL -I/usr/local/cuda/include -o hello_world_cl

(For MacOSX, the library should be replaced by -framework OpenCL, as usual.)

Run with


You are not supposed to edit the code (although you are welcome to experiment, of course). But knowing what the code does, you should be able to figure out where some interesting things take place. First of all, have a look at the input data and the kernel, and then at the main program.

Question: How is the communication between the host and the graphic card handled?

What function executes your kernel?

Question: How does the kernel know what element to work on?

2. Rank sorting on the GPU

Rank sorting is inefficient in sequential code but easy to parallelize. It is a very simple algorithm where you, for all elements, simply count the number of elements that are smaller, thereby creating a rank, an index, and then you move the element to that index.

2.1 Bad approach, do it better

The following code implements a rank sorting algorithm on the CPU as well as the GPU. An image is created by the program, which just is a gradient since it holds sorted numbers. The left image is the data from the CPU, and the right from the GPU. The two should be indentical. However, the GPU output data is incorrect. Why? How can you do it right?

Note that the GPU output sometimes turns out correct, by pure luck, so you may need more than one test run. (This is particularly true with an older version of, but should be no problem now.)

Hint: It isn't any minor bug, but rather a flaw in the general approach.

The main program and the kernel are in separate files, as they should be. There is also a small utility file.


(Old versions: CLutilities.h CLutilities.c sort.c - will work but give some warnings. Kept here just in case I make some mistake with the updated files.)

Compile with

IDA: gcc sort.c CLutilities.c -lOpenCL -lGL -lglut -I/sw/cuda/4.2/include -o sort
ISY: gcc sort.c CLutilities.c -lOpenCL -lGL -lglut -I/usr/local/cuda/include -o sort

and run with


Question: What was the problem, and the fix?

2.2 Improve performance with local memory

Your new kernel should work perfectly, but it isn't very fast. The next step is to accelerate it using local memory (what we called shared in CUDA). Local memory is declared in the kernel as:

__local float myBuffer[512];

However, the speedup may not be significant on small data sets since we measure with data transfer time. Try increasing the data size by changing the dataWidth and dataHeight constants at the top of the program. A good solution should handle data size above 64x64 gracefully.

Question: How did you take advantage of local memory?

Question: What speedup did you get, and why?

Note: Constant memory fits this problem well, and trying it is highly recommended.

3. Wavelet transform

The wavelet transform is a famous operation that is used in image compression algorithms. In its simplest form, from an image it calculates four images of 1/4 the size, each pixel being produced by a combination of four source pixels, different for each of the four resulting images. These operations are:

out1 = (in1 + in2 + in3 + in4)/4
out2 = (in1 + in2 - in3 - in4)/4
out3 = (in1 - in2 + in3 - in4)/4
out4 = (in1 - in2 - in3 + in4)/4

This is pseudocode, refers to entire pixels in a 2x2 neighborhood, and four output pixels in suitable positions. Translate to real code. Note that each pixel consists of three color channels (red, green, blue). You also need to adjust the numerical range somewhat while computing.

Notice that the original image can be reconstructed perfectly from this information.

For the famous Lenna image, the result is like this:

Wavelet transformed Lenna

Essentially, what you see is one low-res image, two images describing edges and one describing extreme points. As you can see, there is less information in the latter images, which means that they are easy to compress.

Implementing this transform in OpenCL is perfectly doable. Would you use shared/local memory in such an implementation?

cl3.c and (below) form a lab shell for this task, similar to the one in part 2. cl3.c loads the Lenna image and passes it to two processing routines that, for now, just inverts the image. Both resulting images are displayed.

(Old version of cl3.c: cl3.c)

Here is the Lenna image:


a) Implement the transform on the CPU first. Add timing and measure.

b) Port your resulting code to OpenCL code and put in the OpenCL kernel.

Question: What speedup did you get compared to your CPU version?

Question: Why do you think a wavelet coded image can be easier to compress than the original?

3 extra (if you have time):

Make a wavelet inverse and see if you can get the original image back. (If you are interested in image coding it is also interesting to study the data reduction.)


That's all for this lab.