Laboration 6: Introduction to OpenCL
This is our lab on OpenCL. It partially builds on an
older lab, designed by Jens, but is quite a bit scaled down and
Some vital notes on OpenCL
CUDA's threadsynchronize() is now called barrier(), with some related
calls in the mem_fence() family. You also have to specify what memory
access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or
both (with the binary or operator "|").
CUDA's shared memory is now called local, declared __local.
Your source code is not part of the main program but must be loaded as
a text buffer.
1. Getting started: Hello, World!
Here is our own Hello World! for OpenCL:
Like all my parallel Hello World! examples, it uses the string "Hello "
plus an array of offsets to produce "World!", thus making a tiny
parallel computation and still does what it should, produces the string
"Hello World!". This is, however, quite a bit longer and harder to read
than the CUDA example.
(IDA:) gcc hello_world_cl.c CLutilities.c -lOpenCL
-I/sw/cuda/4.2/include -o hello_world_cl
(ISY:) gcc hello_world_cl.c CLutilities.c -lOpenCL
-I/usr/local/cuda/include -o hello_world_cl
(For MacOSX, the library should be replaced by -framework OpenCL, as
You are not supposed to edit the code (although you are welcome to
experiment, of course). But knowing what the code does, you should be
able to figure out where some interesting things take place. First of
all, have a look at the input data and the kernel, and then at the main
Question: How is the
communication between the host and the graphic card handled?
Question: What function executes your kernel?
Question: How does the kernel
know what element to work on?
2. Rank sorting on the GPU
Rank sorting is inefficient in
sequential code but easy to parallelize. It is a very simple algorithm
where you, for all elements, simply count the number of elements that
are smaller, thereby creating a rank, an index, and then you move the
element to that index.
2.1 Bad approach, do it better
The following code implements a rank sorting algorithm on the
CPU as well as the GPU. An image is created by the program, which just
is a gradient
since it holds sorted numbers. The left image is the data from the CPU,
and the right from the GPU. The two should be indentical. However, the
GPU output data is incorrect. Why? How can you do it right?
Note that the GPU output sometimes turns out correct, by pure luck, so
you may need more than one test run. (This is particularly true with an
older version of sort.cl, but should be no problem now.)
Hint: It isn't any minor bug, but rather a flaw in the general approach.
The main program and the kernel are in separate files, as they should
be. There is also a small utility file.
(Old versions: CLutilities.h CLutilities.c sort.c - will work but give some
warnings. Kept here just in case I make some mistake with the updated
IDA: gcc sort.c CLutilities.c -lOpenCL -lGL -lglut
-I/sw/cuda/4.2/include -o sort
ISY: gcc sort.c CLutilities.c -lOpenCL -lGL -lglut
-I/usr/local/cuda/include -o sort
and run with
Question: What was the problem,
and the fix?
2.2 Improve performance with local memory
Your new kernel should work perfectly, but it isn't very fast. The next
step is to accelerate it using local memory (what we called shared in
CUDA). Local memory is declared in the kernel as:
__local float myBuffer;
However, the speedup may not be significant on small data sets since we
measure with data transfer time. Try increasing the data size by
changing the dataWidth and dataHeight constants at the top of the
program. A good solution should handle data size above 64x64 gracefully.
Question: How did you
take advantage of local memory?
Question: What speedup did you get, and why?
Note: Constant memory fits this problem well, and trying it is highly
3. Wavelet transform
The wavelet transform is a famous
operation that is used in image
compression algorithms. In its simplest form, from an image it
calculates four images of 1/4 the size, each pixel being produced by a
combination of four source pixels, different for each of the four
resulting images. These operations are:
out1 = (in1 + in2 + in3 + in4)/4
out2 = (in1 + in2 - in3 - in4)/4
out3 = (in1 - in2 + in3 - in4)/4
out4 = (in1 - in2 - in3 + in4)/4
This is pseudocode, refers to entire pixels in a 2x2 neighborhood, and
four output pixels in suitable positions. Translate to real code. Note
that each pixel consists of three color channels (red, green, blue).
You also need to adjust the numerical range somewhat while computing.
Notice that the original image can be reconstructed perfectly from this
For the famous Lenna image, the result is like this:
Wavelet transformed Lenna
Essentially, what you see is one low-res image, two images describing
edges and one describing extreme points. As you can see, there is less
information in the latter images, which means that they are easy to
Implementing this transform in OpenCL is perfectly doable. Would you
use shared/local memory in such an implementation?
cl3.c and 3.cl (below) form a lab shell for this task, similar to the
one in part 2. cl3.c loads the Lenna image and passes it to two
processing routines that, for now, just inverts the image. Both
resulting images are displayed.
(Old version of cl3.c: cl3.c)
Here is the Lenna image:
a) Implement the transform on the CPU first. Add timing and measure.
b) Port your resulting code to OpenCL code and put in the OpenCL kernel.
What speedup did you get compared to your CPU version?
Question: Why do you think a
wavelet coded image can be easier to compress than the original?
3 extra (if you have time):
Make a wavelet inverse and see if you can get the original image back.
(If you are interested in image coding it is also interesting to study
the data reduction.)
That's all for this lab. Just a few more questions:
Part 3 is new this year, in order to expand the OpenCL lab a little
bit. (It was clearly too small before.) However, we wanted to expand
carefully, to stay on the managable side. Was it interesting enough,
worthwhile, or still too easy? Would it have been better to get less
ready-made code and adapt part 2 instead?
Evaluation question: Do you think we have a reasonable mix of
CUDA and OpenCL?