Lab 6: Image filtering with OpenCL


Important! Work in progress! The lab series will be revised for the 2014 course! The material in the following links may be heavily altered before the labs start.

News 2014-12-11: First beta available! The material works on CentOS but the instructions below may be confused and incomplete. If you are doing the lab early, please let me know if you find errors.

This is our lab on OpenCL, new for 2014.

Some vital notes on OpenCL

CUDA's threadsynchronize() is now called barrier(), with some related calls in the mem_fence() family. You also have to specify what memory access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or both (with the binary or operator "|").

CUDA's shared memory is now called local, declared __local.

Your source code is not part of the main program but must be loaded as a text buffer.

1. Getting started: Hello, World!

Here is our own Hello World! for OpenCL:

hello_world_cl.c

Like all my parallel Hello World! examples, it uses the string "Hello " plus an array of offsets to produce "World!", thus making a tiny parallel computation and still does what it should, produces the string "Hello World!". This is, however, quite a bit longer and harder to read than the CUDA example.

Compile with

(IDA:) gcc hello_world_cl.c CLutilities.c -lOpenCL -I/sw/cuda/4.2/include -o hello_world_cl

(ISY:) gcc hello_world_cl.c CLutilities.c -lOpenCL -I/usr/local/cuda/include -o hello_world_cl

(For MacOSX, the library should be replaced by -framework OpenCL, as usual.)

Run with

./hello_world_cl

You are not supposed to edit the code (although you are welcome to experiment, of course). But knowing what the code does, you should be able to figure out where some interesting things take place. First of all, have a look at the input data and the kernel, and then at the main program.

Question: How is the communication between the host and the graphic card handled?

Question:
What function executes your kernel?

Question: How does the kernel know what element to work on?

2. Image filtering

Note: I have problems deciding on whether I should describe the problem in CUDA terms or OpenCL terms. The mix below can be a bit confusing, I guess. Anyway, "blocks" = "work groups", "thread" = "work item".

It can hardly be stressed enough how important it is to utilize local (shared) memory when doing GPU computing. Local (shared) memory allows us to re-arrange memory before writing, to take advantage of coalescing, and to re-use data to reduce the total number of global memory accesses.

In this exercise, you will start from a working OpenCL program which applies a linear filter to an image. The original image is shown to the left, and the filtered image to the right.

Your task is to accelerate this operation by preloading image data into local (shared) memory. You will have to split the operation to a number of work groups (blocks) and only read the part of the image that is relevant for your computation.

It can be noted that there are several ways to do image filtering that we do not take into account here. First of all, we can utilize separable filters. We may also use texture memory in order to take advantage of cache. Neither is demanded here. The focus here is memory addressing and local (shared) memory.

You need to use __local for declaring shared (local) memory, e.g. "__local unsigned char[64];" for allocating a 64-byte array.

After loading data to shared memory, before processing, you should use barrier(CLK_LOCAL_MEM_FENCE) or barrier(CLK_GLOBAL_MEM_FENCE) to synchronize. Beware that a global fence can be slow when you only need a local one!

Lab files:

lab6beta.tar.gz

filter.c/filter.cl is a naive OpenCL implementation which is not using local memory. ppmread.c and ppmread.h read and write PPM files (a very simple image format). CLutilities help you loading files and displaying errors.

NOTE: We don't expect you to make a perfectly optimized solution, but your solution should to a reasonable extent follow the guidelines for a good OpenCL kernel. You only need to support one filter size. 5x5, that is KERNELSIZE = 2, the pre-set, should be enough for a significant difference. The first target (above) is to reduce global memory access, but have you thought about coalescing, control divergence and occupancy? Be prepared for additional questions on these subjects.

NOTE: The first run of a modified kernel tends to take more time than the following ones. Run more than once to get an accurate number.

QUESTION: How much data did you put in local (shared memory?

QUESTION: How much data does each thread copy to local memory?

QUESTION: How did you handle the necessary overlap between the work groups?

QUESTION: If we would like to increase the block size, about how big work groups would be safe to use in this case? Why?

QUESTION: How much speedup did you get over the naive version?

Extra (if you have time):

A "real" filter should have different weights for each element. For a 5x5 filter it may look like this (example from my lectures):

5xr5 kernel weights/256

One possible way to implement these weights is as an array. Since all threads will access this array in the same order, it is suitable to store this array as constant memory (as described in the second CUDA lecture). Create this array and make it available to your kernel.

QUESTION: Were there any particular problems in adding this feature?


That is all for lab 6. Write down answers to all questions and then show your results to us.