Lab 5: Memory addressing, synchronization and shared memory in CUDA

Important! Work in progress! The lab series will be revised for the 2014 course! The material in the following links may be heavily altered before the labs start. The changes will not appear immediately at the course start but some time before the official first Lab 4 session.

It can hardly be stressed enough how important it is to utilize shared memory when doing GPU computing. Shared memory allows us to re-arrange memory before writing, to take advantage of coalescing, and to re-use data to reduce the total number of global memory accesses.

In this exercise, you will start from a working CUDA program which applies a linear filter to an image. The original image is shown to the left, and the filtered image to the right (using code that is almost the same as in "interactivejulia" from lab 4).

Your task is to accelerate this operation by preloading image data into shared memory. You will have to split the operation to a number of blocks and only read the part of the image that is relevant for your computation.

It can be noted that there are several ways to do image filtering that we do not take into account here. First of all, we can utilize separable filters. We may also use texture memory in order to take advantage of cache. Neither is demanded here. The focus here is memory addressing and shared memory.

You need to use __shared__ for declaring shared memory, e.g. "__shared__ unsigned char[64];" for allocating a 64-byte array.

After loading data to shared memory, before processing, you should __syncthreads(); to synchronize.

Consider compiling with -arch = sm30. That way you demand compute capability 3, which means fairly modern hardware (which you have in Southfork, but not in Olympen). Then you can printf() from the kernel, which is useful for debugging. (With older GPUs you can perform this with device_emulation, but then the code will run very slow.)

Lab files: is a naive CUDA implementation which is not using shared memory. ppmfilter.c is a C implementation, included only as reference. ppmread.c and ppmread.h read and write PPM files (a very simple image format).

NOTE: We don't expect you to make a perfectly optimized solution, but your solution should to a reasonable extent follow the guidelines for a good CUDA kernel. The first target (above) is to reduce global memory access, but have you thought about coalescing, control divergence and occupancy? Be prepared for additional questions on these subjects.

QUESTION: How much data did you put in shared memory?

QUESTION: How much data does each thread copy to shared memory?

QUESTION: How did you handle the necessary overlap between the blocks?

QUESTION: If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?

QUESTION: How much speedup did you get over the naive version?

Extra (if you have time):

A "real" filter should have different weights for each element. For a 5x5 filter it may look like this (example from my lectures):

5xr5 kernel weights/256

One possible way to implement these weights is as an array. Since all threads will access this array in the same order, it is suitable to store this array as constant memory (as described in the second CUDA lecture). Create this array and make it available to your kernel.

QUESTION: Were there any particular problems in adding this feature?

That is all for lab 5. Write down answers to all questions and then show your results to us.