Important! Work in progress!
This lab is brand new as of 2014, and is uploaded as a "beta". Last minute changes during the tuesday are likely.
In this lab, we will focus on bitonic merge sort, which is data independent, not quite as fast as QuickSort but still quite fast.
As a base, you get a working CPU implementation of bitonic merge sort, in C/C++:bitonic_kernel.cpp
This code is not even optimal on the CPU (single threaded), and not in the least parallel, but not hard to make parallel.
What you need to do is to rewrite this program (bitonic_kernel to be precise) to a CUDA program. Your solution should work for at least 100000 elements and for any kind of input. For simplicity, you only need to consider power-of-two sizes! If you like, you can work after this schedule:
Step 1: Turn it into a CUDA program. You can do this single-threaded to begin with.Step 2: Make it work multi-threaded in one block. Change bitonic_main to larger input.
Step 3: Make it work in multiple blocks.
Step 4: Measure speedup for varying size. When does it break even (with single thread CPU)?
QUESTION: Should each thread produce one output or two? Why?
QUESTION: How many items can you handle in one block?
QUESTION: What problem must be solved when you use more than one block? How did you solve it?
QUESTION: What time do you get? Difference to the CPU? What
is the break even size? What can you expect for a parallel CPU version?
(Your conclusions here may vary between the labs.)
That is all for lab 5. Write down answers to all questions and then show your results to us.