Laboration 4: Introduction to CUDA

This lab introduces you to CUDA computing with a few small exercises. We have deliberately avoided larger amounts of code for this lab, and save that for lab 5.

Remote access:

Some of you have wished to do some of the lab work remotely. This is possible, at least for the parts without graphical output. You ssh to ISY's server ixtab, and from there to one of the Southfork computers.

You can not run on ixtab, since it has no GPU!

So you so like this:

ssh USERNAME@ixtab.edu.isy.liu.se
ssh southfork-06.edu.isy.liu.se

(Or some other Southfork machine. You don't all have to use number 06.)

Note: You should not run remote on a busy machine! Use the "w" command to see if someone else is logged in.

Examination:

You should write down answers to all questions and then show us your results.

Deadline:

To be announced. All three labs 4-6 have a common deadline after the last lab.

Part 0. Prerequisites

If you run in the lab, no installation is needed as it has been done for you! If you run on your own computer, this can be more or less complicated. Follow the instructions on NVidia's CUDA page.

However, in the lab you need to do the following:

Download this simple script:

cudaida.sh (for the IDA lab)

cudaisy.sh (for the ISY lab)

Switch to the bash shell by typing "bash" and run the script.

/bin/bash
source cudaida.sh

or

/bin/bash
source cudaisy.sh

Now you are ready to use CUDA!

Part 1. Trying out CUDA

To get started, we will use the "simple.cu" example that I introduced in the lecture, arguably the simplest CUDA example. All it does is to assign every element in an array with its index.

a) Compile and run simple.cu

Here is the program:

simple.cu

You can compile it with this simple command-line:

nvcc simple.cu -o simple

This works nicely in the lab, but for completion, let me show how a slighly more elaborate line can look:

nvcc simple.cu -L /usr/local/cuda/lib -lcudart -o simple

Run with:

./simple

(Feel free to build a makefile. You can base it on those from earlier labs.)

If it works properly, it should output the numbers 0 to 15 as floating-point numbers.

As it stands, the amount of computation is ridiculously small. We will soon address problems where performance is meaningful to measure. But first, let's modify the code to get used to the format.

QUESTION: How many cores will simple.cu use, max, as written? How many SMs?

b) Modifying simple.cu

Allocate an array of data and use as input. Calculate the square root of every element. Inspect the output so it is correct.

To upload data to the GPU, you must use the cudaMemcpy() call with cudaMemcpyHostToDevice.

Note: If you increase the data size here, please check out the comments in section 2.

QUESTION: Is the calculated square root identical to what the CPU calculates? Should we assume that this is always the case?

2. Performance and block size

Now let us work on a larger dataset and make more computations. We will make fairly trivial item-by-item computations but experiment with the number of blocks and threads.

a) Array computation from C to CUDA

Here is a simple C program that takes two arrays (matrices) and adds them component-wise.

matrix_cpu.c

It can be compiled with
gcc matrix_cpu.c -o matrix_cpu -std=c99

and run with

./matrix_cpu

Write a CUDA program that performs the same thing, in parallel! Start with a grid size of (1, 1) and a block size of (N, N). Then try a bigger grid, with more blocks.

We must calculate an index from the thread and block numbers. That can look like this (in 1D):

int idx = blockIdx.x * blockDim.x + threadIdx.x;

QUESTION: How do you calculate the index in the array, using 2-dimensional blocks?

b) Larger data set and timing with CUDA Events

In order to measure execution time for your kernels you should use CUDA Events.

A CUDA event variable "myEvent" is declared as
  cudaEvent_t myEvent;
It must beinitialized with
  cudaEventCreate(&myEvent);
You insert it in the CUDA stream with
  cudaEventRecord(myEvent, 0);
The 0 is the stream number, where 0 is the default stream.

To make sure an event have finished (received its value), call
  cudaEventSynchronize(myEvent);

Important! You must use cudaEventSynchronize before taking time measurements, or you don't know if the computation has ended!

Finally, you get the time between two events with
  cudaEventElapsedTime(&theTime, myEvent, laterEvent);
where theTime is a float.

For timing CPU code, you can use the following code:

milli.c
milli.h

Note that in CUDA, large arrays are best allocated in C++ style: float *c = new float[N*N];

Vary N. You should be able to run at least 1024x1024 items.

Vary the block size (and consequently the grid size).

Non-mandatory: Do you find the array initialization time consuming? Write a kernel for that, too!

QUESTION: What happens if you use too many threads per block?

Note: You can get misleading output if you don't clear your buffers. Old data from earlier computations will remain in the memory.

Also note that timings can be made with or without data transfers. In a serious computations, the data transfers certainly are important, but for our small examples, it is more relevant to take the time without data transfers, or both with and without.

Note: When you increase the data size, there are (at least) two problems that you will run into:

QUESTION: At what data size is the GPU faster than the CPU?

QUESTION: What block size seems like a good choice? Compared to what?

QUESTION: Write down your data size, block size and timing data for the best GPU performance you can get.

c) Coalescing

In this part, we will make a simple experiment to inspect the impact of memory coalescing, or the lack of it.

In part 2, you calculated a formula for accessing the array items based on thread and block indices. Probably, a change in X resulted in one step in the array. That will make data aligned and you get good coalescing. Swap the X and Y calculations so a change in Y results in one step in the array. Measure the computation time.

Note: The results have varied a lot in this exercise. On my 9400, I got a very significant difference, while this difference is not quite as big in the lab. However, lack of difference can be caused by erroneous computation, so make sure that the output is correct.

QUESTION: How much performance did you lose by making data accesses non-coalesced?

3. One more time on basic performance - Interactive Julia

CUDA by example has a Julia fractal demo that wasn't optimized for speed. This is understandable, since the demo was only ment as an example of simple CUDA code, and a teaser for how easy it is to get some visual output. The demo was also kind of dull, only producing a still image. I have written an interactive version, were you can explore the fractal by moving the mouse. It is also rewritten to a single file for simplicity:

interactiveJulia.cu

Note that you must compile with -lglut (and possibly -lGL and -lGLU for other Unixes).

Put in a timer in the code. Only by inspecting the code, you should find that it isn't reasonably written. Why? Fix the problem. (Hint: It is nothing that you havn't seen before, but rather a rehearsal of older problems - but a lot prettier.)

Here you will need one more CUDA event call: cudaEventDestroy(), to clean up after a CUDA event and avoid memory leaks.

QUESTION: What was the problem?

QUESTION: How did you correct it?

QUESTION: What speedup did you get?

4. Beefing up the interactive fractal (non-mandatory, if you have time)

This exercise is not mandatory, but if you have the time, we suggest that you work with the interactive fractal of part 3. You don't need any OpenGL experience but it will help you to understand the lab shell.

Learning goal with this part: To work with a CUDA-based program, editing CPU as well as GPU code, exercising the combination while still using simple CUDA code. Impressing other students and maybe the lab assistant.

As a teaser, I made this slightly changed color scheme, and it only took a few minutes to code! (And of course it runs just as fast as your solution to part 3.)



Feel free to make experiments on how to map the result of the recursion into presented colors. Much of the beauty of fractals come from the choice of colors and tresholds. You may also consider totally different changes, like changing the scale and translation to make it a fractal explorer (Julia or Mandelbrot) etc.

QUESTION: What kind of modifications did you implement?


That is all for lab 4. Write down answers to all questions and then show your results to us.