

# Lecture 12

### Even more CUDA memory

Histogram

Sorting on GPU

Other problems







# **Reduction - many to few**

Problems that are tricky to run in parallel.

Acceleration can be limited or nonexistant for small datasets.

Find miniumum, maximum, average...





# **Divergent branching =**

# "if" statements:

### Branches can be bad in GPU code!

Why?



# **Divergent branching in SIMD:**

All branches execute *all code*! Data masked with result of "if".

Warp-level problem!

Can not be avoided within warps if a single thread gets a different result from others. Can be avoided if all threads in warp take same branch



Information Coding / Computer Graphics, ISY, LiTH

### **Divergent warp**

if X then 10010110

and with 10010110

else

and with 01101001

endif

# **Non-divergent warp**

if X then 11111111 else endif







### Lecture questions

1) In what way does bitonic merge sort fit the GPU better than many other sorting algorithms?

2) Bitonic merge sort is Nlog<sup>2</sup>N and QuickSort is NlogN. Why they will still have similar complexity on the GPU?

3) What is the reason to use pinned memory?

4) What problem does atomics solve?



# More memory

Managed memory

**Atomics** 

**Pinned memory** 



Information Coding / Computer Graphics, ISY, LiTH

### Managed memory

Makes read/write memory as easy as constant!

New, simpler Hello World!

```
#include <stdio.h>
                                                         int main()
                                                            printf("%s", a);
const int N = 16;
                                                            dim3 dimBlock( blocksize, 1 );
const int blocksize = 16;
                                                            dim3 dimGrid(1, 1);
                                                            hello<<<dimGrid, dimBlock>>>(a, b);
__global__
void hello(char *a, int *b)
   a[threadIdx.x] += b[threadIdx.x];
                                                            printf("%s\n", a);
                                                            return EXIT_SUCCESS;
__managed__ char a[N] = "Hello 0 0 0 0";
```

cudaDeviceSynchronize(); // Synchronize



### Managed memory

Managed memory must be declared \_\_\_\_managed\_\_\_\_

Memory accessible both from CPU and GPU. Risk for racing!

Copy to GPU or copy to \_\_\_\_\_managed\_\_\_\_, same thing.

Do not expect performance penalty (but always be ready for surprises).





# **Atomic operations**

A special memory access method, for avoiding conflicts and race conditions.

Available in CUDA from Compute model 1.1 (which means everywhere).

Specify compute model with

-arch compute\_11

but you probably don't have to. (I didn't.)



Information Coding / Computer Graphics, ISY, LiTH

# **Example: Histogram**

Simple method for gathering statistics about a set of data. Much data in, little out.

Common in image processing.



for all elements i in a[] h[a[i]] += 1





# **Histogram in parallel**

Each thread reads a[threadIdx]

and perform h[a[threadIdx]] = h[a[threadIdx]] + 1

...not





# **Solution: Atomics**

Read - modify - write in one operation

Guaranteed not to be subject to racing

atomicAdd, atomicSub, atomicExch, atomicMin, atomicMax, atomicInc, atomicDec, atomicCAS, atomicAND, atomicOR, atomicXor

More types in Fermi and up

Supported for both global and shared memory.



# But it comes for a cost!

Slower than other operations

Simpler but slower than reduction solutions!





# How would I do histograms?

Atomics are fairly OK... 256 lanes of parallelism.

Split to parts for separate blocks.

Produce one histogram for each part.

Merge result, possibly in several reduction steps.





# **Example: Find maximum**

for all elements i in a[] maxValue = max(maxValue, a[i])

Easy? Yes! Parallel? No!

All threads will write to the same memory element!

Use atomics? Very slow! All write at the same time, must wait -> sequential performance!

Solution: Use reduction instead!



# **Atomic conclusions**

Simplifies some operations

Serializes conflicting operations

Can hurt performance! Don't overuse!



# More exotic optimizations and tools

**Pinned memory** 

Multiple streams

Not where you start but let's not ignore the options.



# **Pinned memory**

Can boost performance for memory transfer

Page-locked memory

So far: malloc() and cudaMalloc()

New call: cudaHostAlloc()

Allocated page-locked memory! Fixed physical location!





Information Coding / Computer Graphics, ISY, LiTH

# **Pinned memory**

Page-locked memory is a limited resource!

For non-pinned memory, CUDA copies it internally to page-locked memory, then DMA to GPU. Transfer time goes up!









### **Pinned memory, streams,** overlapping computation

Pinned memory is part of an optimization approach with overlapping computations

No longer just a slight speedup of data transfer!

cudaMemCpyAsynch() can copy locked memory asynchronously!



# **Multiple streams**

CUDA commands are placed in a queue, a stream!

These are the same queues as you can post CUDA events to.

We usually only use the default CUDA stream.

Multiple CUDA streams can be used to overlap work especially computing and data transfers!



### Single stream computation

The kernel can not run until the data is transferred.

For this example, 2/3 data transfer, 1/3 computation

Run kernel

Run kernel





### **Dual stream computation**

While one stream runs a kernel, the other stream performs data copying,

More time for computing, in this example kernels are running 1/2 of the time instead of 1/3.

| Copy data to GPU   |   |
|--------------------|---|
| Run kernel         | ( |
| Copy result to CPU | F |
| Copy data to GPU   | - |
| Run kernel         | ( |
| -                  | ( |
| Copy result to CPU | F |
|                    | - |





### Not all devices...

Asynchronous data copying as well as concurrent execution is not guaranteed...

so make a device query!

CU\_DEVICE\_ATTRIBUTE\_ASYNCH\_ENGINE\_COUNT: Can we copy memory asynch?

CU\_DEVICE\_ATTRIBUTE\_CONCURRENT\_KERNELS: Can we run multiple kernels?





# **Debugging CUDA**

Let's get a bit more efficient when your code doesn't work

- Catch error codes
- printf() from kernels
  - cudagdb





### Catch those error codes

```
// Check for errors everywhere
err = cudaMalloc( (void**)&ad, csize );
// If the GPU won't even take our data we are toasted
if (err) printf("cudaMalloc %d %s\n", err, cudaGetErrorString(err));
...
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
// Most important thing to check? Did the kernel run at all?
err = cudaPeekAtLastError();
if (err) printf("cudaPeekAtLastError %d %s\n", err, cudaGetErrorString(err));
```

and pass them to cudaGetErrorString() for an explanation





# printf() from kernels

### Yes - printf() if legal in a kernel since Compute Capability 2.0

But don't try to print 100000 messages per second...



# More advanced debugger tools

There are more tools to help you out there!

cuda-gdb

Variant of the GDB debugger

Allows breakpoints and single-stepping CUDA kernels!

