The real "Hello World!" for CUDA, OpenCL and GLSL!

by Ingemar Ragnemalm

 

Background

When I learned CUDA, I found that just about every tutorial and course starts with something that they call "Hello World". But, usually that is not at all an "Hello world" program at all! What they mean by "Hello world" is any kind of simple example. That is obviously ignorant and careless. Their simple examples are often (but not always) simple and enlightning, but mislabeled.

Let's get one thing straight: "Hello world!" should produce a string identical to or similar to "Hello World!", and that is all it should do! It should not print several strings, and it should not do other, irrelevant things!

32 times Hello

For quite some time, the only program claiming to be "Hello world" for CUDA I had found that is the slightest related to what it claims to be is a program by Karen Hains, found at this web page. Unlike most other "hello cuda" it does print the string "Hello World"... 32 times! And it also informs us of block and thread numbers and times the computation. The "simple Hello World kernel" is 90 lines of code, comments and blank lines disregared, and not counting the host program. Simple?

So what should it be then?

The problem with "Hello World!" for CUDA is simply this: You can't just printf("Hello World!\n"), because then you are not running any CUDA at all! It would just be a C example! "Hello World!" for CUDA must do something in parallel, with a kernel run in the GPU!

Better attempts

More recently, two much better attempts showed up at the NVIDIA forum.

Byron Galbraith mangles the "Hello World!" string, and unmangles it in CUDA.

Another example (anonymous, the poster refers to "one of my students") assigns characters in parallel from a string constant.

These are a lot better! They are simple, they do something in parallel, and they produce the string they should. Good! I only have minor objections: Galbraiths version will do the same thing if you just erase the kernel and the mangling loop; the output exists on the host from the start. The other one is only data copying, nothing is computed (but wonderfully simple).

Hello World! for CUDA - the real thing!

Here is my version: I take the string "Hello ", send that plus the array 15, 10, 6, 0, -11, 1 to a kernel. The kernel adds the array elements to the string, which produces the array "World!". This string is passed back to the host and printed out.

Simple, parallel, relevant, and the output is Hello World!

Here follows the code. At 30 lines of code (44 with comments and blank lines), and a single-line kernel, this is both simple, relevant and can be called a real "Hello World!".


// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010
 
#include <stdio.h>
 
const int N = 16;
const int blocksize = 16;
 
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
 
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
 
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
 
printf("%s", a);
 
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
cudaFree( bd );

printf("%s\n", a);
return EXIT_SUCCESS;
}


hello-world.cu

You may use this code for any purpose, as long as any derivative cites its source and document changes. You may upload it to your own web page granted that you cite me as author.

If you do something interesting related to this code (including improvements), please let me know. My E-mail address is ingis at the subdomain isy, domain liu, top domain se (Sweden).

Ingemar Ragnemalm, programmer and CUDA teacher

 

PS: I have also produced Hello World! for OpenCL (updated 2013 to support some changes in newer CL versions):

hello_world_cl.c

and Hello World! for GLSL fragment shaders (updated 2013 supporting the negative offset properly):

hello-gpgpu.c

These follow the same pattern as above, add an offset array to the string "Hello ".

PS: There examples are tested, fully working, on my machine. However, incompatibilities may surface even for simple examples like these. I will update whenever I find such issues, but any help you can give me is appreciated.