Homework 2: More image operations with
OpenCL
Examination:
You should write down answers to all questions, make an archive of
that and your resulting source files and mail to me.
Deadline:
No strict deadline. My recommendation is to finish before HW 2, that
is after easter.
In this lab the intention is that you use OpenCL. It is mostly similar
to CUDA. We want to widen your scope a bit, while getting further into
the basic image processing tools.
Please have some tolerance for mistakes. I had to whip this together
rather quickly. All steps are tested, but only on my Mac so you may
need to make some adjustments for other systems.
Some vital notes on OpenCL
CUDA's threadsynchronize() is now called barrier(), with some related
calls in the mem_fence() family. You also have to specify what memory
access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or
both (with the binary or operator "|").
CUDA's shared memory is now called local, declared __local.
Your source code is not part of the main program but must be loaded as
a text buffer.
1. Getting started: Hello, World!
Here is our own Hello World! for OpenCL:
hello_world_cl.c
which uses
CLutilities.c
CLutilities.h
Like all my parallel Hello World! examples, it uses the string "Hello "
plus an array of offsets to produce "World!", thus making a tiny
parallel computation and still does what it should, produces the string
"Hello World!". This is, however, quite a bit longer and harder to read
than the CUDA example.
Compile with
gcc hello_world_cl.c CLutilities.c -lOpenCL
-I/usr/local/cuda/include -o hello_world_cl
(For MacOSX, the library should be replaced by -framework OpenCL, as
usual.)
Run with
./hello_world_cl
You are not supposed to edit the code (although you are welcome to
experiment, of course). But knowing what the code does, you should be
able to figure out where some interesting things take place. First of
all, have a look at the input data and the kernel, and then at the main
program.
Question: How is the
communication between the host and the graphic card handled?
Question: What function executes your kernel?
Question: How does the kernel
know what element to work on?
2. Histogram calculation on the GPU
Histograms are common tools in image processing, for gathering some
basic statistics of an image. On the CPU, it is a straight forward
sequential task. Your task is to figure out a suitable way to perform
this on the GPU. I am not saying that any approach is right or wrong.
This is not a tough operation, so don't expect a big speedup.
loadimage.c and loadimage.cl (below) form a lab shell for this
task. loadimage.c loads the Lenna image and passes it to two
processing routines (CPU and GPU) that, for now, caclulates the
luminance of the first 256 pixels. Results are printed to stdout.
Timing is included, and you should expect the CPU to outperform the GPU
since the problem is too small. With some more work for each thread
(work item) the balance will be better.
loadimage.cl
loadimage.c
readppm.c
readppm.h
CLutilities.c
CLutilities.h
Here is the Lenna image:
lenna512.ppm
Compile with
gcc loadimage.c CLutilities.c -lOpenCL
-I/usr/local/cuda/include -o loadimage
and run with
./loadimage
You may wish to use local
memory (what we called shared in
CUDA). Local memory is declared in the kernel like this:
__local float myBuffer[512];
Question: How did you
parallize the problem?
Question: What performance did you get, and why?
3. Wavelet transform
The wavelet transform is a famous
operation that is used in image
compression algorithms. In its simplest form, from an image it
calculates four images of 1/4 the size, each pixel being produced by a
combination of four source pixels, different for each of the four
resulting images. These operations are:
out1 = (in1 + in2 + in3 + in4)/4
out2 = (in1 + in2 - in3 - in4)/4
out3 = (in1 - in2 + in3 - in4)/4
out4 = (in1 - in2 - in3 + in4)/4
This is pseudocode, refers to entire pixels in a 2x2 neighborhood, and
four output pixels in suitable positions. Translate to real code. Note
that each pixel consists of three color channels (red, green, blue).
You also need to adjust the numerical range somewhat while computing.
Notice that the original image can be reconstructed perfectly from this
information.
For the famous Lenna image, the result is like this:
Wavelet transformed Lenna
Essentially, what you see is one low-res image, two images describing
edges and one describing extreme points. As you can see, there is less
information in the latter images, which means that they are easy to
compress.
Implementing this transform in OpenCL is perfectly doable. Would you
use shared/local memory in such an implementation?
Lab files: invertimage.c and invertimage.cl (below) form a lab shell
for this
task. invertimage.c loads the Lenna image and passes it to two
processing routines (CPU and GPU) that, for now, just inverts the
image. Both
resulting images are displayed side by side.
invertimage.cl
invertimage.c
readppm.c
readppm.h
Here is the Lenna image:
lenna512.ppm
a) Implement the transform on the CPU first. Add timing and measure.
b) Port your resulting code to OpenCL code and put in the OpenCL kernel.
Question:
What speedup did you get compared to your CPU version?
Question: Why do you think a
wavelet coded image can be easier to compress than the original?
Extra (if you have time):
Make a wavelet inverse and see if you can get the original image back.
(If you are interested in image coding it is also interesting to study
the data reduction.)
---
That's all for this homework. Submit your results to me.