Lab 5: Bitonic sort in CUDA

Important! Work in progress!

This lab is brand new as of 2014, and is uploaded as a "beta". Last minute changes during the tuesday are likely.

Recommended preparation: Study the bitonic merge sort algorithm! Can you see the access patterns?

Sorting is an important problem for the course. This lab (new for 2014) is entirely about that. Sorting on the GPU is a challenging task. Most sorting algorithms, including QuickSort, have been implemented on the GPU, but many sorting algorithm are highly data dependent, which makes them complicated. All sorting algorithms are challenging for the GPU, since the problem is not as straight-forward and fitting as others.

In this lab, we will focus on bitonic merge sort, which is data independent, not quite as fast as QuickSort but still quite fast.

As a base, you get a working CPU implementation of bitonic merge sort, in C/C++:

bitonic_kernel.cpp
bitonic_main.cpp
bitonic_kernel.h
milli.h
milli.c

This code is not even optimal on the CPU (single threaded), and not in the least parallel, but not hard to make parallel.

What you need to do is to rewrite this program (bitonic_kernel to be precise) to a CUDA program. Your solution should work for at least 100000 elements and for any kind of input. For simplicity, you only need to consider power-of-two sizes! If you like, you can work after this schedule:

Step 1: Turn it into a CUDA program. You can do this single-threaded to begin with.

Step 2: Make it work multi-threaded in one block. Change bitonic_main to larger input.

Step 3: Make it work in multiple blocks.

Step 4: Measure speedup for varying size. When does it break even (with single thread CPU)?

QUESTION: Should each thread produce one output or two? Why?

QUESTION: How many items can you handle in one block?

QUESTION: What problem must be solved when you use more than one block? How did you solve it?

QUESTION: What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)

That is all for lab 5. Write down answers to all questions and then show your results to us.

The sorting competition (non-mandatory):

If you want to compete in the course challenge, this is the lab to work from for the GPU side. There are more improvements to add after what we deand for the lab.

News from Nicolas: If you want to compete only on the GPU side, that is OK, you do not have to submit a "dummy" CPU version, but then you can't win the total.

GPU submissions should be sent to Ingemar.