Laboration 6: Introduction to OpenCL
NEW LAB. THERE MAY BE BUGS IN THE LAB MATERIAL. PLEASE LET US KNOW IF YOU FIND BUGS.
This is our brand new lab on OpenCL. It partially builds on an
older lab, designed by Jens, but is quite a bit scaled down and
simplified.
Originally we intended to include computing with GLSL here, too, but
for those of you who never used GLSL and OpenGL before, that would be
too time consuming, would endanger your deadline. We choose to go
somewhat deeper in OpenCL instead.
Some vital notes on OpenCL
CUDA's threadsynchronize() is now called barrier(), with some related
calls in the mem_fence() family. You also have to specify what memory
access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or
both (with the binary or operator "|").
CUDA's shared memory is now called local, declared __local.
Your source code is not part of the main program but must be loaded as a text buffer.
1. Getting started: Hello, World!
Here is our own Hello World! for OpenCL:
hello_world_cl.c
Like all my parallel Hello World! examples, it uses the string "Hello "
plus an array of offsets to produce "World!", thus making a tiny
parallel computation and still does what it should, produces the string
"Hello World!". This is, however, quite a bit longer and harder to read
than the CUDA example.
Compile with
gcc hello_world_cl.c -lOpenCL -o hello_world_cl
(For MacOSX, the library should be replaced by -framework OpenCL, as usual.)
Run with
./hello_world_cl
You are not supposed to edit the code (although you are welcome to
experiment, of course). But knowing what the code does, you should be
able to figure out where some interesting things take place. First of
all, have a look at the input data and the kernel, and then at the main
program.
Question: How is the communication between the host and the graphic card handled?
Question: What function executes your kernel?
Question: How does the kernel know what element to work on?
2. Rank sorting on the GPU
Rank sorting is inefficient in
sequential code but easy to parallelize. It is a very simple algorithm
where you, for all elements, simply count the number of elements that
are smaller, thereby creating a rank, an index, and then you move the
element to that index.
2.1 Bad approach, do it better
The following code implements a rank sorting algorithm on the
CPU as well as the GPU. An image is created by the program, which just
is a gradient
since it holds sorted numbers. The left image is the data from the CPU,
and the right from the GPU. The two should be indentical. However, the
GPU output data is incorrect. Why? How can you do it right?
Hint: It isn't any minor bug, but rather a flaw in the general approach.
The main program and the kernel are in separate files, as they should be. There is also a small utility file.
sort.c
sort.cl
CLutilities.c
CLutilities.h
Compile with
gcc sort.c CLutilities.c -lOpenCL -lGL -lglut -o sort
and run with
./sort
Question: What was the problem, and the fix?
2.2 Improve performance with local memory
Your new kernel should work perfectly, but it isn't very fast. The next
step is to accelerate it using local memory (what we called shared in
CUDA). Local memory is declared in the kernel as:
__local float myBuffer[512];
However, the speedup may not be significant on small data sets since we
measure with data transfer time. Try increasing the data size by
changing the dataWidth and dataHeight constants at the top of the
program. A good solution should handle data size above 64x64 gracefully.
Question: How did you take advantage of local memory?
Question: What speedup did you get, and why?
Note: Constant memory fits this problem well, and trying it is highly recommended.
3. Extras
After much consideration, I decided that the parts above are enough for
a good introduction to OpenCL, so this last part is non-mandatory for
those of you who find part 1 and 2 easy and want to try some more. (At
this time the following is a draft and not properly tested.)
If you don't have time or energy for these suggestions, just skip to the final questions at the end.
Proposed exercise a: Wavelet transform
Again: Not mandatory! But lab groups have tried it and liked it.
The wavelet transform is a famous operation that is used in image
compression algorithms. In its simplest form, from an image it
calculates four images of 1/4 the size, each pixel being produced by a
combination of four source pixels, different for each of the four
resulting images. These operations are:
out = (in1 + in2 + in3 + in4)/4
out = (in1 + in2 - in3 - in4)/4
out = (in1 - in2 + in3 - in4)/4
out = (in1 - in2 - in3 + in4)/4
Notice that the original image can be reconstructed perfectly from this information.
For the famous Lenna image, the result is like this:
Wavelet transformed Lenna
Essentially, what you see is one low-res image, two images describing
edges and one describing extreme points. As you can see, there is less
information in the latter images, which means that they are easy to
compress.
Implementing this transform in OpenCL is perfectly doable. Would you use shared/local memory in such an implementation?
Proposed exercise b: Bitonic Merge Sort
One more time: Not mandatory!
The sorting algorithm in part 1, rank sorting, is obviously not the
best one possible. One algorithm that is particularly nice for parallel
implementation is Bitonic Merge Sort (which I briefly mentioned on a
lecture).
Here is a CPU implementation in C:
void cpu_Sort_BMS(unsigned int *data, unsigned int length)
{
unsigned int i,j,k,ixj,tmp;
/* bitonic merge sort */
for (k=2;k<=length;k=2*k)
{
for (j=k>>1;j>0;j=j>>1)
{
for (i=0;i<length;i++)
{
ixj=i^j;
if ((ixj)>i)
{
if ((i&k)==0 && data[i]>data[ixj])
{
tmp=data[i];
data[i]=data[ixj];
data[ixj]=tmp;
}
if ((i&k)!=0 && data[i]<data[ixj])
{
tmp=data[i];
data[i]=data[ixj];
data[ixj]=tmp;
}
}
}
}
}
}
Implement this in parallel on the GPU! (Pretty
challenging, multi-pass algorithm.) Compare performance to the rank
sorting above.
Proposed exercise c: (No more yet)
(This is a placeholder for future expansion.)
That's all for this lab. Just a few more questions:
Evaluation question: Was this lab too easy, about right or even too hard? If it was hard, why?
Evaluation question: Do you think we have a reasonable mix of CUDA and OpenCL?