progress! This lab was brand new as of 2014. There will be further
changes for 2015. They are included below. There may be last minute changes but this is fairly complete now.
QUESTION: What timing did you get for your GPU reduction? Compare it to the CPU version.
QUESTION: Try larger data size. On what size does the GPU version get faster, or at least comparable, to the GPU?
QUESTION: How can you optimize this further? You should know at least one way.
In this lab, we will focus on bitonic merge sort, which is data independent, not quite as fast as QuickSort but still quite fast.
As a base, you get a working CPU implementation of bitonic merge sort, in C/C++:bitonic_kernel.cpp
This code is not even optimal on the CPU (single threaded), and not in the least parallel, but not hard to make parallel.
What you need to do is to rewrite this program (bitonic_kernel to be
precise) to a CUDA program. Your solution should work for at least
100000 elements and for any kind of input. For simplicity, you only need to consider power-of-two
sizes! If you like, you can work after this schedule:
Step 1: Turn it into a CUDA program. You can do this single-threaded to begin with.Step 2: Make it work multi-threaded in one block. Change bitonic_main to larger input.
Step 3: Make it work in multiple blocks.
Step 4: Measure speedup for varying size. When does it break even (with single thread CPU)?
STEP 5-6 ARE NOT SATISFACORY YET, THUS NON-MANDATORY. SORRY FOR THE INCONVENIENCE.
Step 5: Use shared memory for optimizing. You should work along these lines:Consider a small version of the algorithm, running in a single block.
- Have all threads read one item to shared memory.
- Sort in shared memory.
- Write back.
How can you use this for a larger algorithm? How about a part of it?
Step 6: Measure speedup again. How does it compare to step 4?
QUESTION: Should each thread produce one output or two? Why?
QUESTION: How many items can you handle in one block?
QUESTION: What problem must be solved when you use more than
one block? How did you solve it?
QUESTION: What time do you get? Difference to the CPU? What
is the break even size? What can you expect for a parallel CPU version?
(Your conclusions here may vary between the labs.)
(QUESTION: How did your execution time change when using shared memory?)
That is all for lab 5. Write down answers to all questions and then show your results to us.