Bitonic sort in CUDA
Sorting is an important problem for the course. This lab is entirely about sorting and sorting-related tasks.
Recommended preparation: Study the bitonic merge sort algorithm! Can you see the access patterns?
Reduction is an important operation which in some sense can be related to sorting, which makes it fit in this lab. The task will be to efficiently find the maximum of a given dataset.
You start from this file:
In this file, there is simple, sequential CPU for finding the maximum of a dataset.
It is notable that this is not a GPU friendly problem. The problem is sequential in nature, the GPU can only compete by means of reduction, which is not massively parallel in all stages, while the CPU is pretty good at the task. However, the GPU will compete better and better the bigger the problem is. We do not require a perfectly optimized version, only a fairly basic reduction. You are, of course, encouraged to make it as fast as you can!
QUESTION: What timing did you get for your GPU reduction? Compare it to the CPU version.
QUESTION: Try larger data size. On what size does the GPU version get faster, or at least comparable, to the GPU?
QUESTION: How can you optimize this further? You should know at least one way.
2. Bitonic merge sort
Sorting on the GPU is a challenging task. Most sorting algorithms, including QuickSort, have been implemented on the GPU, but many sorting algorithms are data dependent, which makes them complicated. All sorting algorithms are challenging for the GPU, since the problem is not as straight-forward and fitting as others.
In this lab, we will focus on bitonic merge sort, which is data independent, not quite as fast as QuickSort but still quite fast.
As a base, you get a working CPU implementation of bitonic merge sort, in C/C++:
This code is not even optimal on the CPU (single threaded), and not in the least parallel, but not hard to make parallel.
What you need to do is to rewrite this program (bitonic_kernel to be precise) to a CUDA program. Your solution should work for at least 100000 elements and for any kind of input. For simplicity, you only need to consider power-of-two sizes! If you like, you can work after this schedule:
Step 1: Turn it into a CUDA program. You can do this single-threaded to begin with.
Step 2: Make it work multi-threaded in one block. Change bitonic_main to larger input.
Step 3: Make it work in multiple blocks.
Step 4: Measure speedup for varying size. When does it break even (with single thread CPU)?
STEP 5-6 ARE NOT SATISFACORY YET, THUS NON-MANDATORY. SORRY FOR THE INCONVENIENCE.
Step 5: Use shared memory for optimizing. You should work along these lines:
Consider a small version of the algorithm, running in a single block.
- Have all threads read one item to shared memory.
- Sort in shared memory.
- Write back.
How can you use this for a larger algorithm? How about a part of it?
Step 6: Measure speedup again. How does it compare to step 4?
QUESTION: Should each thread produce one output or two? Why?
QUESTION: How many items can you handle in one block?
QUESTION: What problem must be solved when you use more than one block? How did you solve it?
QUESTION: What time do you get? Difference to the CPU? What is the break even size? What can you expect for a parallel CPU version? (Your conclusions here may vary between the labs.)
(QUESTION: How can you use shared memory to optimize? Did your multiple block solution change?)
(QUESTION: How did your execution time change when using shared memory?)
That is all for lab 5. Write down answers to all questions and then show your results to us.
The sorting competition (non-mandatory):
If you want to compete in the course challenge, this is the lab to work from for the GPU side. There are more improvements and fine tuning to add after what we demand for the lab.
News from Nicolas: If you want to compete only on the GPU side, that is OK, you do not have to submit a "dummy" CPU version, but then you can't win the total.
GPU submissions should be sent to Ingemar.