GPU Computing course - Laboration 3

Computing with CUDA

 

BRAND NEW LAB - PLEASE ASK IF SOMETHING IS VAGUE

Conclusions after the scheduled lab: All software is good and in no need of updating. However, a few clarifications have been added to the text.

 

Here is the third and final lab, and thereby the third major approach to GPU compuing. Are we in for any surprises? Not really. Concerning the kernel languages, the similarities are more numerous than the differences. The big difference when using CUDA is, obviously, that we get an integrated source with the host and kernel code in the same file (if we want). Run-time code translations are hidden from us.

Examination:

As before, you should write down answers to all questions and E-mail the answers and resulting code to me.

Deadline:

To be announced.

Part 0. Prerequisites

If you run in the lab, no installation is needed as it has been done for you! If you run on your own computer, this can be more or less complicated. Follow the instructions on NVidia's CUDA page.

However, in the lab you need to do the following:

Download this simple script:

cuda.sh

Switch to the bash shell by typing "bash" and run the script.

/bin/bash
source cuda.sh

Now you are ready to use CUDA!

Part 1. Trying out CUDA

To get started, we will use the "simple.cu" example that I introduced in the lecture, arguably the simplest CUDA example. All it does is to assign every element in an array with its index.

a) Compile and run simple.cu

Here is the program:

simple.cu

You can compile it with this simple command-line:

nvcc simple.cu -o simple

This seems to work nicely in the lab, but just in case, let me show how a slighly more elaborate line can look:

nvcc simple.cu -L /usr/local/cuda/lib -lcudart -o simple

Run with:

./simple

(Feel free to build a makefile. You can base it on those from earlier labs.)

If it works properly, it should output the numbers 0 to 15 as floating-point numbers.

As it stands, the amount of computation is ridiculously small. We will soon address problems where performance is meaningful to measure. But first, let's modify the code to get used to the format.

QUESTION: How many cores will simple.cu use, max, as written?

b) Modifying simple.cu

Allocate an array of data and use as input. Calculate the square root of every element. Inspect the output so it is correct.

To upload data to the GPU, you must use the cudaMemcpy() call with cudaMemcpyHostToDevice.

Note: If you increase the data size here, please check out the comments in section 2.

QUESTION: Is the calculated square root identical to what the CPU calculates? Should we asume that this is always the case?

2. Performance and block size

Now let us work on a larger dataset and make more computations. We will make fairly trivial item-by-item computations but experiment with the number of blocks and threads.

a) Array computation from C to CUDA

Here is a simple C program that takes two arrays (matrices) and add them component-wise.

matrix_cpu.c

Write a CUDA program that performs the same thing, in parallel! Use (for now) a grid size of (1, 1) and a block size of (N, N).

We must calculate an index from the thread and block numbers. That can look like this (in 1D):

int idx = blockIdx.x * blockDim.x + threadIdx.x;

QUESTION: How do you calculate the index in the array, using 2-dimensional blocks?

b) Larger data set and timing

Add a timer (milli.c) and integrate the same computation in C and CUDA for comparison. This is easiest to add to your program with

#include "milli.c"

Download here:

milli.c

milli.h

Note that in CUDA, large arrays are best allocated in C++ style: float *c = new float[N*N];

Vary N. You should be able to run at least 1024x1024 items.

Vary the block size (and consequently the grid size).

Important! You must use cudaThreadSynchronize before taking time measurements, or you don't know if the computation has ended!

Also note that timings can be made with or without data transfers. In a serious computation, the data transfers certainly are important, but for our small examples, it is more relevant to take the time without data transfers, or both with and without.

Note: When you increase the data size, there are (at least) two problems that you will run into:

QUESTION: At what data size is the GPU faster than the CPU?

QUESTION: What block size seems like a good choice? Compared to what?

QUESTION: Write down your data size, block size and timing data for the best GPU performance you can get.

c) Coalescing

In this part, we will make a simple experiment to inspect the impact of memory coalescing, or the lack of it.

In part 2, you calculated a formula for accessing the array items based on thread and block indices. Probably, a change in X resulted in one step in the array. That will make data aligned and you get good coalescing. Swap the X and Y calculations so a change in Y results in one step in the array. Measure the computation time.

Note: The results have varied a lot in this exercise. On my 9400, I got a very significant difference, while this difference may not be as big in the lab. However, lack of difference can be caused by erroneous computation, so make sure that the output is correct.

QUESTION: How much performance did you lose by making data accesses non-coalesced?

3. Memory addressing, synchronization, shared memory

It can hardly be stressed enough how important it is to utilize shared memory when doing GPU computing. Shared memory allows us to re-arrange memory before writing, to take advantage of coalescing, and to re-use data to reduce the total number of global memory accesses.

In this exercise, you will start from a working CUDA program which applies a linear filter to an image. Your task is to accelerate this operation by preloading image data into shared memory. You will have to split the operation to a number of blocks and only read the part of the image that is relevant for your computation.

It can be noted that there are several ways to do image filtering that we do not take into account here. First of all, we can utilize separable filters, like in lab 1. We may also use texture memory in order to take advantage of cache. Neither is demanded here. The focus here is memory addressing and shared memory.

You need to use __shared__ for declaring shared memory, e.g. "__shared__ unsigned char[64];" for allocating a 64-byte array.

After loading data to shared memory, before processing, you should __syncthreads(); to synchronize.

Consider using the compilation switch --device-emulation to run in emulation while debugging. Then you can printf() from the kernel.

Lab files:

ppmfilter.zip

ppmfilter.cu is a naive CUDA implementation which is not using shared memory. ppmfilter.c is a C implementation, included only as reference. ppmread.c and ppmread.h are new variants of the ppm loader, made CUDA friendly.

Note: To view ppm images, just double-click them and Linux wiill find the vewer for you.

QUESTION: How much data did you put in shared memory?

QUESTION: How much data does each thread copy to shared memory?

QUESTION: How did you handle the necessary overlap between the blocks?

QUESTION: If we would like to increase the block size, about how big blocks would be safe to use in this case? Why?

BONUS QUESTION: How much speedup did you get over the naive version?