Homework 2: More image operations with OpenCL

Examination:

You should write down answers to all questions, make an archive of that and your resulting source files and mail to me.

Deadline:

No strict deadline. My recommendation is to finish before HW 2, that is after easter.

In this lab the intention is that you use OpenCL. It is mostly similar to CUDA. We want to widen your scope a bit, while getting further into the basic image processing tools.

Please have some tolerance for mistakes. I had to whip this together rather quickly. All steps are tested, but only on my Mac so you may need to make some adjustments for other systems.

Some vital notes on OpenCL

CUDA's threadsynchronize() is now called barrier(), with some related calls in the mem_fence() family. You also have to specify what memory access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or both (with the binary or operator "|").

CUDA's shared memory is now called local, declared __local.

Your source code is not part of the main program but must be loaded as a text buffer.

1. Getting started: Hello, World!

Here is our own Hello World! for OpenCL:

hello_world_cl.c

which uses

CLutilities.c
CLutilities.h

Like all my parallel Hello World! examples, it uses the string "Hello " plus an array of offsets to produce "World!", thus making a tiny parallel computation and still does what it should, produces the string "Hello World!". This is, however, quite a bit longer and harder to read than the CUDA example.

Compile with

gcc hello_world_cl.c CLutilities.c -lOpenCL -I/usr/local/cuda/include -o hello_world_cl

(For MacOSX, the library should be replaced by -framework OpenCL, as usual.)

Run with

./hello_world_cl

You are not supposed to edit the code (although you are welcome to experiment, of course). But knowing what the code does, you should be able to figure out where some interesting things take place. First of all, have a look at the input data and the kernel, and then at the main program.

Question: How is the communication between the host and the graphic card handled?

Question:
What function executes your kernel?

Question: How does the kernel know what element to work on?

2. Histogram calculation on the GPU

Histograms are common tools in image processing, for gathering some basic statistics of an image. On the CPU, it is a straight forward sequential task. Your task is to figure out a suitable way to perform this on the GPU. I am not saying that any approach is right or wrong. This is not a tough operation, so don't expect a big speedup.

loadimage.c and loadimage.cl (below) form a lab shell for this task. loadimage.c loads the Lenna image and passes it to two processing routines (CPU and GPU) that, for now, caclulates the luminance of the first 256 pixels. Results are printed to stdout. Timing is included, and you should expect the CPU to outperform the GPU since the problem is too small. With some more work for each thread (work item) the balance will be better.

loadimage.cl
loadimage.c
readppm.c
readppm.h
CLutilities.c
CLutilities.h

Here is the Lenna image:

lenna512.ppm

Compile with

gcc loadimage.c CLutilities.c -lOpenCL -I/usr/local/cuda/include -o loadimage

and run with

./loadimage

You may wish to use local memory (what we called shared in CUDA). Local memory is declared in the kernel like this:

__local float myBuffer[512];

Question: How did you parallize the problem?

Question: What performance did you get, and why?

3. Wavelet transform

The wavelet transform is a famous operation that is used in image compression algorithms. In its simplest form, from an image it calculates four images of 1/4 the size, each pixel being produced by a combination of four source pixels, different for each of the four resulting images. These operations are:

out1 = (in1 + in2 + in3 + in4)/4
out2 = (in1 + in2 - in3 - in4)/4
out3 = (in1 - in2 + in3 - in4)/4
out4 = (in1 - in2 - in3 + in4)/4

This is pseudocode, refers to entire pixels in a 2x2 neighborhood, and four output pixels in suitable positions. Translate to real code. Note that each pixel consists of three color channels (red, green, blue). You also need to adjust the numerical range somewhat while computing.

Notice that the original image can be reconstructed perfectly from this information.

For the famous Lenna image, the result is like this:



Wavelet transformed Lenna

Essentially, what you see is one low-res image, two images describing edges and one describing extreme points. As you can see, there is less information in the latter images, which means that they are easy to compress.

Implementing this transform in OpenCL is perfectly doable. Would you use shared/local memory in such an implementation?

Lab files: invertimage.c and invertimage.cl (below) form a lab shell for this task. invertimage.c loads the Lenna image and passes it to two processing routines (CPU and GPU) that, for now, just inverts the image. Both resulting images are displayed side by side.

invertimage.cl
invertimage.c
readppm.c
readppm.h

Here is the Lenna image:

lenna512.ppm

a) Implement the transform on the CPU first. Add timing and measure.

b) Port your resulting code to OpenCL code and put in the OpenCL kernel.

Question: What speedup did you get compared to your CPU version?

Question: Why do you think a wavelet coded image can be easier to compress than the original?

Extra (if you have time):

Make a wavelet inverse and see if you can get the original image back. (If you are interested in image coding it is also interesting to study the data reduction.)

---

That's all for this homework. Submit your results to me.