Image filtering with OpenCL
This is our lab on OpenCL, new for 2014, no changes planned for 2015. Some bug fixes for 2016.
Some vital notes on OpenCL
CUDA's threadsynchronize() is now called barrier(), with some related calls in the mem_fence() family. You also have to specify what memory access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or both (with the binary or operator "|").
CUDA's shared memory is now called local, declared __local.
Your source code is not part of the main program but must be loaded as a text buffer.
1. Getting started: Hello, World!
Here is our own Hello World! for OpenCL:
Like all my parallel Hello World! examples, it uses the string "Hello " plus an array of offsets to produce "World!", thus making a tiny parallel computation and still does what it should, produces the string "Hello World!". This is, however, quite a bit longer and harder to read than the CUDA example.
gcc hello_world_cl.c CLutilities.c -lOpenCL -I/usr/local/cuda/include -o hello_world_cl
NOTE: Earlier years, the following command-line applied in the IDA lab. I have not been able to verify it but you should probably use the one above now. If not, this is given here:
(Old IDA command-line:) gcc hello_world_cl.c CLutilities.c -lOpenCL -I/sw/cuda/4.2/include -o hello_world_cl
(If you use MacOSX, the library should be replaced by -framework OpenCL, as usual.)
You are not expected to edit the code (although you are welcome to experiment, of course). But knowing what the code does, you should be able to figure out where some interesting things take place. First of all, have a look at the input data and the kernel, and then at the main program.
Question: How is the communication between the host and the graphic card handled?
Question: What function executes your kernel?
Question: How does the kernel know what element to work on?
2. Image filtering
Note: I have problems deciding on whether I should describe the problem in CUDA terms or OpenCL terms. The mix below can be a bit confusing, I guess. Anyway, "blocks" = "work groups", "thread" = "work item".
It can hardly be stressed enough how important it is to utilize local (shared) memory when doing GPU computing. Local (shared) memory allows us to re-arrange memory before writing, to take advantage of coalescing, and to re-use data to reduce the total number of global memory accesses.
In this exercise, you will start from a working OpenCL program which applies a linear filter to an image. The original image is shown to the left, and the filtered image to the right.
Your task is to accelerate this operation by preloading image data into local (shared) memory. You will have to split the operation to a number of work groups (blocks) and only read the part of the image that is relevant for your computation.
It can be noted that there are several ways to do image filtering that we do not take into account here. First of all, we can utilize separable filters. We may also use texture memory in order to take advantage of cache. Neither is demanded here. The focus here is memory addressing and local (shared) memory.
You need to use __local for declaring shared (local) memory, e.g. "__local unsigned char;" for allocating a 64-byte array.
After loading data to shared memory, before processing, you should use barrier(CLK_LOCAL_MEM_FENCE) or barrier(CLK_GLOBAL_MEM_FENCE) to synchronize. Beware that a global fence can be slow when you only need a local one!
filter.c/filter.cl is a naive OpenCL implementation which is not using local memory. ppmread.c and ppmread.h read and write PPM files (a very simple image format). CLutilities help you loading files and displaying errors.
NOTE: We don't expect you to make a perfectly optimized solution, but your solution should to a reasonable extent follow the guidelines for a good OpenCL kernel. You only need to support one filter size, but I recommend that you support a variable size. 5x5, that is KERNELSIZE = 2, the pre-set, might not be enough for a significant difference due to the startup time of OpenCL. I got more significant differences with larger size with my solution. With 7x7, I got a fairly consistent 3x speedup (in Southfork) over the naive kernel.
The first target (above) is to reduce global memory access, but have you thought about coalescing, control divergence and occupancy? Be prepared for additional questions on these subjects.
NOTE: The first run of a modified kernel tends to take more time than the following ones. Run more than once to get an accurate number.
QUESTION: How much data did you put in local (shared memory?
QUESTION: How much data does each thread copy to local memory?
QUESTION: How did you handle the necessary overlap between the work groups?
QUESTION: If we would like to increase the block size, about how big work groups would be safe to use in this case? Why?
QUESTION: How much speedup did you get over the naive version?
Extra (if you have time):
A "real" filter should have different weights for each element. For a 5x5 filter it may look like this (example from my lectures):
One possible way to implement these weights is as an array. Since all threads will access this array in the same order, it is suitable to store this array as constant memory (as described in the second CUDA lecture). Create this array and make it available to your kernel.
QUESTION: Were there any particular problems in adding this feature?
That is all for lab 6. Write down answers to all questions and then show your results to us.