Laboration 6: Introduction to OpenCL

NEW LAB. THERE MAY BE BUGS IN THE LAB MATERIAL. PLEASE LET US KNOW IF YOU FIND BUGS.

This is our brand new lab on OpenCL. It partially builds on an older lab, designed by Jens, but is quite a bit scaled down and simplified.

Originally we intended to include computing with GLSL here, too, but for those of you who never used GLSL and OpenGL before, that would be too time consuming, would endanger your deadline. We choose to go somewhat deeper in OpenCL instead.

Some vital notes on OpenCL

CUDA's threadsynchronize() is now called barrier(), with some related calls in the mem_fence() family. You also have to specify what memory access you synchronize, CLK_LOCAL_MEM_FENCE, CLK_GLOBAL_MEM_FENCE, or both (with the binary or operator "|").

CUDA's shared memory is now called local, declared __local.

Your source code is not part of the main program but must be loaded as a text buffer.

1. Getting started: Hello, World!

Here is our own Hello World! for OpenCL:

hello_world_cl.c

Like all my parallel Hello World! examples, it uses the string "Hello " plus an array of offsets to produce "World!", thus making a tiny parallel computation and still does what it should, produces the string "Hello World!". This is, however, quite a bit longer and harder to read than the CUDA example.

Compile with

gcc hello_world_cl.c -lOpenCL -o hello_world_cl

(For MacOSX, the library should be replaced by -framework OpenCL, as usual.)

Run with

./hello_world_cl

You are not supposed to edit the code (although you are welcome to experiment, of course). But knowing what the code does, you should be able to figure out where some interesting things take place. First of all, have a look at the input data and the kernel, and then at the main program.

Question: How is the communication between the host and the graphic card handled?

Question:
What function executes your kernel?

Question: How does the kernel know what element to work on?

2. Rank sorting on the GPU

Rank sorting is inefficient in sequential code but easy to parallelize. It is a very simple algorithm where you, for all elements, simply count the number of elements that are smaller, thereby creating a rank, an index, and then you move the element to that index.

2.1 Bad approach, do it better

The following code implements a rank sorting algorithm on the CPU as well as the GPU. An image is created by the program, which just is a gradient since it holds sorted numbers. The left image is the data from the CPU, and the right from the GPU. The two should be indentical. However, the GPU output data is incorrect. Why? How can you do it right?

Hint: It isn't any minor bug, but rather a flaw in the general approach.

The main program and the kernel are in separate files, as they should be. There is also a small utility file.

sort.c
sort.cl
CLutilities.c
CLutilities.h

Compile with

gcc sort.c CLutilities.c -lOpenCL -lGL -lglut -o sort

and run with

./sort

Question: What was the problem, and the fix?

2.2 Improve performance with local memory

Your new kernel should work perfectly, but it isn't very fast. The next step is to accelerate it using local memory (what we called shared in CUDA). Local memory is declared in the kernel as:

__local float myBuffer[512];

However, the speedup may not be significant on small data sets since we measure with data transfer time. Try increasing the data size by changing the dataWidth and dataHeight constants at the top of the program. A good solution should handle data size above 64x64 gracefully.

Question: How did you take advantage of local memory?

Question: What speedup did you get, and why?

Note: Constant memory fits this problem well, and trying it is highly recommended.

3. Extras

After much consideration, I decided that the parts above are enough for a good introduction to OpenCL, so this last part is non-mandatory for those of you who find part 1 and 2 easy and want to try some more. (At this time the following is a draft and not properly tested.)

If you don't have time or energy for these suggestions, just skip to the final questions at the end.

Proposed exercise a: Wavelet transform

Again: Not mandatory! But lab groups have tried it and liked it.

The wavelet transform is a famous operation that is used in image compression algorithms. In its simplest form, from an image it calculates four images of 1/4 the size, each pixel being produced by a combination of four source pixels, different for each of the four resulting images. These operations are:

out = (in1 + in2 + in3 + in4)/4
out = (in1 + in2 - in3 - in4)/4
out = (in1 - in2 + in3 - in4)/4
out = (in1 - in2 - in3 + in4)/4

Notice that the original image can be reconstructed perfectly from this information.

For the famous Lenna image, the result is like this:



Wavelet transformed Lenna

Essentially, what you see is one low-res image, two images describing edges and one describing extreme points. As you can see, there is less information in the latter images, which means that they are easy to compress.

Implementing this transform in OpenCL is perfectly doable. Would you use shared/local memory in such an implementation?

Proposed exercise b: Bitonic Merge Sort

One more time: Not mandatory!

The sorting algorithm in part 1, rank sorting, is obviously not the best one possible. One algorithm that is particularly nice for parallel implementation is Bitonic Merge Sort (which I briefly mentioned on a lecture).

Here is a CPU implementation in C:

void cpu_Sort_BMS(unsigned int *data, unsigned int length)
{
  unsigned int i,j,k,ixj,tmp;

  /* bitonic merge sort */
  for (k=2;k<=length;k=2*k)
  {
    for (j=k>>1;j>0;j=j>>1)
    {
      for (i=0;i<length;i++)
      {
        ixj=i^j;
        if ((ixj)>i)
        {
          if ((i&k)==0 && data[i]>data[ixj])
          {
            tmp=data[i];
            data[i]=data[ixj];
            data[ixj]=tmp;
          }
          if ((i&k)!=0 && data[i]<data[ixj])
          {
            tmp=data[i];
            data[i]=data[ixj];
            data[ixj]=tmp;
          }
        }
      }
    }
  }
}


Implement this in parallel on the GPU! (Pretty challenging, multi-pass algorithm.) Compare performance to the rank sorting above.

Proposed exercise c: (No more yet)

(This is a placeholder for future expansion.)



That's all for this lab. Just a few more questions:

Evaluation question: Was this lab too easy, about right or even too hard? If it was hard, why?

Evaluation question:
Do you think we have a reasonable mix of CUDA and OpenCL?