Lab 6

OpenCL

Reduction and sorting

The compilation lines (below and in the source files) are tested in both Southfork, Signal&Bild and the Multicore lab. In the Multicore lab you may need to change the path to CUDA slightly since an alias was missing.

1. Getting started: Hello, World!

Here is our own Hello World! for OpenCL:

hello_world_cl.c

This uses CLutilities, a small utility unit that we will use throughout the lab to make life easier:

CLutilities.c

CLutilities.h

Updated 2021! Some old files are saved here in case of broken new uploads:

not so old files (just previous 2021 version)

old files

Like all my parallel Hello World! examples, it uses the string "Hello " plus an array of offsets to produce "World!", thus making a tiny parallel computation and still does what it should, produces the string "Hello World!". This is, however, quite a bit longer and harder to read than the CUDA example.

Compile with

gcc hello_world_cl.c CLutilities.c -lOpenCL -I/usr/local/cuda/include -o hello_world_cl

(If you use MacOSX, the library should be replaced by -framework OpenCL, as usual.)

Run with

./hello_world_cl

You are not expected to edit the code (although you are welcome to experiment, of course). But knowing what the code does, you should be able to figure out where some interesting things take place. First of all, have a look at the input data and the kernel, and then at the main program.

Question: How is the communication between the host and the graphic card handled?

Question: What function executes your kernel?

Question: How does the kernel know what element to work on?

2. Reduction

Reduction is an important operation which in some sense can be related to sorting, which makes it fit in this lab. The task will be to efficiently find the maximum of a given dataset.

You work from these files:

find_max.c

find_max.cl

In this file, there is simple, sequential CPU function for finding the maximum of a dataset.

For easy timing we provide these (again):

milli.h
milli.c

It is notable that this is not a GPU friendly problem. The problem is sequential in nature, the GPU can only compete by means of reduction, which is not massively parallel in all stages, while the CPU is pretty good at the task. However, the GPU will compete better and better the bigger the problem is. We do not require a perfectly optimized version, only a fairly basic reduction. You are, of course, encouraged to make it as fast as you can!

For your convenience, the execution of the kernel has been isolated into a separate function, runKernel(), so the parent function, find_max_gpu, is simplified. You may wish to edit it, depending on how your solution works.

QUESTION: What timing did you get for your GPU reduction? Compare it to the CPU version.

QUESTION: Try larger data size. On what size does the GPU version get faster, or at least comparable, to the CPU?

QUESTION: How can you optimize this further? You should know at least one way.

3. Bitonic merge sort

Sorting on the GPU is a challenging task. Most sorting algorithms, including QuickSort, have been implemented on the GPU, but many sorting algorithms are data dependent, which makes them complicated. All sorting algorithms are challenging for the GPU, since the problem is not as straight-forward and fitting as others.

In this lab, we will focus on bitonic merge sort, which is data independent, not quite as fast as QuickSort but still quite fast.

As a base, you get a working CPU implementation of bitonic merge sort, in C/C++:

bitonic.cl

bitonic.c

This code is not even optimal on the CPU (single threaded), and not in the least parallel, but not hard to make parallel.

What you need to do is to rewrite the CPU version into an OpenCL program/kernel. Your solution should work for at least 100000 elements and for any kind of input. For simplicity, you only need to consider power-of-two sizes! If you like, you can work after this schedule:

Step 1: Turn it into an OpenCL program. You can do this single-threaded to begin with.

Step 2: Make it work multi-threaded in one workgroup (block). 

Step 3: Make it work in multiple workgroups.

Step 4: Measure speedup for varying size. When does it break even (with single thread CPU)?

As in part 2, the function runKernel is provided, but in this case you are likely to edit it, and its parameters, more than in part 2.

QUESTION: Should each thread produce one output or two? Why?

QUESTION: How many items can you handle in one workgroup?

QUESTION: What problem must be solved when you use more than one workgroup? How did you solve it?

QUESTION: What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)

STEP 5-6 ARE NON-MANDATORY.

Step 5: Use local (shared) memory for optimizing. You should work along these lines:

Consider a small version of the algorithm, running in a single block.

- Have all threads read one item to local memory.

- Sort in local memory.

- Write back.

How can you use this for a larger algorithm? How about a part of it?

Step 6: Measure speedup again. How does it compare to step 4?

(QUESTION: How can you use local memory to optimize? Did your multiple block solution change?)

(QUESTION: How did your execution time change when using local memory?)