Work in progress! The lab series will be revised for the 2014 course!
The material in the following links may be heavily altered before the
labs start. The changes will not appear immediately at the course start
but some time before the official first Lab 4 session.
It can hardly be stressed enough how important it is to utilize shared memory when doing GPU computing. Shared memory allows us to re-arrange memory before writing, to take advantage of coalescing, and to re-use data to reduce the total number of global memory accesses.
In this exercise, you will start from a working CUDA program which
applies a linear filter to an image. The
original image is shown to the left, and the filtered image to the
right (using code that is almost the same as in "interactivejulia" from
Your task is to accelerate this operation by preloading image data into shared memory. You will have to split the operation to a number of blocks and only read the part of the image that is relevant for your computation.
It can be noted that there are several ways to do image filtering that we do not take into account here. First of all, we can utilize separable filters. We may also use texture memory in order to take advantage of cache. Neither is demanded here. The focus here is memory addressing and shared memory.
You need to use __shared__ for declaring shared memory, e.g. "__shared__ unsigned char;" for allocating a 64-byte array.
After loading data to shared memory, before processing, you should __syncthreads(); to synchronize.
Consider compiling with -arch = sm30. That way you demand compute
capability 3, which means fairly modern hardware (which you have in
Southfork, but not in Olympen). Then you can printf() from the kernel,
which is useful for debugging. (With older GPUs you can perform this
with device_emulation, but then the code will run very slow.)
ppmfilter.cu is a naive CUDA implementation which is not using shared memory. ppmfilter.c is a C implementation, included only as reference. ppmread.c and ppmread.h read and write PPM files (a very simple image format).
NOTE: We don't expect you to make a perfectly optimized solution,
but your solution should to a reasonable extent follow the guidelines
for a good CUDA kernel.
The first target (above) is to reduce global memory access, but have
you thought about coalescing, control divergence and occupancy? Be
prepared for additional questions on these subjects.
QUESTION: How much data did you put in shared memory?
QUESTION: How much data does each thread copy to shared memory?
QUESTION: How did you handle the necessary overlap between the blocks?
QUESTION: If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?QUESTION: How much speedup did you get over the naive version?
QUESTION: Were there any particular problems in adding this
That is all for lab 5. Write down answers to all questions and then
show your results to us.