Homework 1: CUDA, getting started and image filtering

Examination:

You should write down answers to all questions, make an archive of that and your resulting source files and mail to me.

Deadline:

No strict deadline. My recommendation is to finish before HW 2, that is within a week.

0. Prerequisites

If you run in the lab, no installation is needed as it has been done for you! If you run on your own computer, this can be more or less complicated. Follow the instructions on NVidia's CUDA page.

However, in the lab you need to do the following:

Download this simple script:

cudaisy.sh

Switch to the bash shell by typing "bash" and run the script.

/bin/bash
source cudaida.sh

or

/bin/bash
source cudaisy.sh

Now you are ready to use CUDA!

1. Trying out CUDA

To get started, we will use the "simple.cu" example that I introduced in the lecture, arguably the simplest CUDA example. All it does is to assign every element in an array with its index. Here is the program:

simple.cu

You can compile it with this simple command-line:

nvcc simple.cu -o simple

This works nicely in the lab, but for completion, let me show how a slighly more elaborate line can look:

nvcc simple.cu -L /usr/local/cuda/lib -lcudart -o simple

Run with:

./simple

(Feel free to build a makefile if you prefer that.)

If it works properly, it should output the numbers 0 to 15 as floating-point numbers.

As it stands, the amount of computation is ridiculously small. We will soon address problems where performance is meaningful to measure. But first, let's modify the code to get used to the format.

QUESTION: How many cores will simple.cu use, max, as written? How many SMs?

2. Performance and block size

Now let us work on a larger dataset and make more computations. We will make fairly trivial item-by-item computations but experiment with the number of blocks and threads.

a) Array computation from C to CUDA

Here is a simple C program that takes two arrays (matrices) and adds them component-wise.

matrix_cpu.c

It can be compiled with
gcc matrix_cpu.c -o matrix_cpu -std=c99

and run with

./matrix_cpu

Write a CUDA program that performs the same thing, in parallel! Start with a grid size of (1, 1) and a block size of (N, N). Then try a bigger grid, with more blocks.

We must calculate an index from the thread and block numbers. That can look like this (in 1D):

int idx = blockIdx.x * blockDim.x + threadIdx.x;

QUESTION: How do you calculate the index in the array, using 2-dimensional blocks?

(From this point some concepts will appear that I havn't covered in the lectures yet, like coalescing and CUDA events. Either you figure it out now or you wait for lecture 3.)

b) Larger data set and timing with CUDA Events

In order to measure execution time for your kernels you should use CUDA Events.

A CUDA event variable "myEvent" is declared as
  cudaEvent_t myEvent;
It must be initialized with
  cudaEventCreate(&myEvent);
You insert it in the CUDA stream with
  cudaEventRecord(myEvent, 0);
The 0 is the stream number, where 0 is the default stream.

To make sure an event have finished (received its value), call
  cudaEventSynchronize(myEvent);

Important! You must use cudaEventSynchronize before taking time measurements, or you don't know if the computation has ended!

Finally, you get the time between two events with
  cudaEventElapsedTime(&theTime, myEvent, laterEvent);
where theTime is a float.

For timing from the CPU side (CPU code or as complement for the CUDA events), you can use the following code:

milli.c
milli.h

Note that in CUDA, large arrays are best allocated in C++ style: float *c = new float[N*N];

Vary N. You should be able to run at least 1024x1024 items.

Vary the block size (and consequently the grid size).

Non-mandatory: Do you find the array initialization time consuming? Write a kernel for that, too!

QUESTION: What happens if you use too many threads per block?

Note: You can get misleading output if you don't clear your buffers. Old data from earlier computations will remain in the memory.

Also note that timings can be made with or without data transfers. In a serious computations, the data transfers certainly are important, but for our small examples, it is more relevant to take the time without data transfers, or both with and without.

Note: When you increase the data size, there are (at least) two problems that you will run into:

QUESTION: At what data size is the GPU faster than the CPU?

QUESTION: What block size seems like a good choice? Compared to what?

QUESTION: Write down your data size, block size and timing data for the best GPU performance you can get.

c) Coalescing

In this part, we will make a simple experiment to inspect the impact of memory coalescing, or the lack of it.

In part 2a, you calculated a formula for accessing the array items based on thread and block indices. Probably, a change in X resulted in one step in the array. That will make data aligned and you get good coalescing. Swap the X and Y calculations so a change in Y results in one step in the array. Measure the computation time.

Note: The results have varied a lot in this exercise. On my 9400, I got a very significant difference, while this difference is not quite as big in the lab. However, lack of difference can be caused by erroneous computation, so make sure that the output is correct.

QUESTION: How much performance did you lose by making data accesses non-coalesced?

3. More basic performance - Interactive Julia

I have written an interactive Julia renderer, were you can explore the fractal by moving the mouse. It is based on an example from the book "CUDA by example", but rewritten to a single file for simplicity:

interactiveJulia.cu

Note that you must compile with -lglut (and possibly -lGL and -lGLU for other Unixes). (We use old-style OpenGL here since it is simpler when we just want to display an image.)

Put in a timer in the code. Only by inspecting the code, you should find that it isn't reasonably written. Why? Fix the problem. (Hint: It is nothing that you havn't seen before, but rather a rehearsal of older problems - but a lot prettier.)

Here you will need one more CUDA event call: cudaEventDestroy(), to clean up after a CUDA event and avoid memory leaks.

QUESTION: What was the problem?

QUESTION: How did you correct it?

QUESTION: What speedup did you get?

4. Memory addressing, synchronization and shared memory in CUDA

It can hardly be stressed enough how important it is to utilize shared memory when doing GPU computing. Shared memory allows us to re-arrange memory before writing, to take advantage of coalescing, and to re-use data to reduce the total number of global memory accesses.

In this exercise, you will start from a working CUDA program which applies a linear filter to an image. The original image is shown to the left, and the filtered image to the right (using code that is almost the same as in "interactivejulia" from lab 4).

Your task is to accelerate this operation by preloading image data into shared memory. You will have to split the operation to a number of blocks and only read the part of the image that is relevant for your computation.

It can be noted that there are several ways to do image filtering that we do not take into account here. First of all, we can utilize separable filters. We may also use texture memory in order to take advantage of cache. Neither is demanded here. The focus here is memory addressing and shared memory.

You need to use __shared__ for declaring shared memory, e.g. "__shared__ unsigned char[64];" for allocating a 64-byte array.

After loading data to shared memory, before processing, you should __syncthreads(); to synchronize.

Consider compiling with -arch = sm30. That way you demand compute capability 3, which means fairly modern hardware (which you have in Southfork, but not in Olympen). Then you can printf() from the kernel, which is useful for debugging. (With older GPUs you can perform this with device_emulation, but then the code will run very slow.)

Lab files:

ppmfilter.zip

ppmfilter.cu is a naive CUDA implementation which is not using shared memory. ppmfilter.c is a C implementation, included only as reference. ppmread.c and ppmread.h read and write PPM files (a very simple image format).

NOTE: We don't expect you to make a perfectly optimized solution, but your solution should to a reasonable extent follow the guidelines for a good CUDA kernel. The first target (above) is to reduce global memory access, but have you thought about coalescing, control divergence and occupancy? Be prepared for additional questions on these subjects.

QUESTION: How much data did you put in shared memory?

QUESTION: How much data does each thread copy to shared memory?

QUESTION: How did you handle the necessary overlap between the blocks?

QUESTION: If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?

QUESTION: How much speedup did you get over the naive version?

5. Make a better filter:

A "real" filter should have different weights for each element. For a 5x5 filter it may look like this (example from my lectures):

5xr5 kernel weights/256

One possible way to implement these weights is as an array. Since all threads will access this array in the same order, it is suitable to store this array as constant memory (as described in the second CUDA lecture). Create this array and make it available to your kernel.

QUESTION: Were there any particular problems in adding this feature?


That is all for homework #1. Write down answers to all questions and submit your results (with code) to me.