Getting started with CUDA (using VS2017)

If you are interested in performance, you need to know more about CUDA.

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.
In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords.

In this post, I will guide you through your first steps with CUDA. Also, I will show you how to move some basic processing from the CPU to the GPU.

Let’s start getting some bytes.

Environment setup

I am using Visual Studio 2017 (version 15.6.0). To code with CUDA, you will need to download and install the CUDA Toolkit. The Toolkit includes Visual Studio project templates and the NSight IDE (which it can use from Visual Studio).

Also, you will need to install the VC++ 2017 toolset (CUDA is still not compatible with the latest version of Visual Studio).

Starting a new project using CUDA

The easiest way to start a project that uses CUDA is utilizing the CUDA template.

To be able to compile this, you will need to change the Project Properties to use the Visual Studio 2015 toolset.

I recommend you to clean the template’s boilerplate changing the content of the file kernel.cu to this:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

Basic CPU processing

Let’s add two arrays (a and b) putting the results in a third array (c).

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

void add_arrays_cpu(int* a, int* b, int* c, const int count)
{
	for (auto i = 0; i < count; i++)
	{
		c[i] = a[i] + b[i];
	}
}

int main()
{
	const auto count = 5;
	int a[] = { 1, 2, 3, 4, 5 };
	int b[] = { 10, 20, 30, 40, 60 };
	int c[count];

	add_arrays_cpu(a, b, c, count);

	for (auto i = 0; i < count; i ++)
	{
		printf("%d ", c[i]);
	}

	getchar();
	return 0;
}

This code is pretty simple. But it is tough to parallelize.

Whenever we want to use parallelism, we need to create functions that could be executed independently. The function add_arrays_cpu has a for-loop that runs the add process in a sequential fashion. But, there is no reason to work in this way. Let’s change it:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

void add_array_element_cpu(int* a, int * b, int* c, const int index)
{
	c[index] = a[index] + b[index];
}

int main()
{
	const auto count = 5;
	int a[] = { 1, 2, 3, 4, 5 };
	int b[] = { 10, 20, 30, 40, 60 };
	int c[count];

	add_arrays_cpu(a, b, c, count);

	for (auto i = 0; i < count; i++)
	{
		add_array_element_cpu(a, b, c, i);
	}

	for (auto i = 0; i < count; i ++)
	{
		printf("%d ", c[i]);
	}

	getchar();

	return 0;
}

The function add_array_element_cpu can be executed independently, and that is great. We could start a thread for each position of the array, and that would work fine (I am not saying that would be the right thing to do with CPUs, just as an example).

Bringing CUDA to the game

The next logical step is to start using the GPU to run our code. We will not move all the functions to the GPU, but only the functions we think we could be parallelized.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void add_arrays_gpu(int* a, int *b, int* c)
{
	c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

int main()
{
	const auto count = 5;
	int host_a[] = { 1, 2, 3, 4, 5 };
	int host_b[] = { 10, 20, 30, 40, 60 };
	int host_c[count];

	int *device_a, *device_b, *device_c;
	const int size = count * sizeof(int);
	cudaMalloc(&device_a, size);
	cudaMalloc(&device_b, size);
	cudaMalloc(&device_c, size);

	cudaMemcpy(
		device_a, host_a,
		size,
		cudaMemcpyHostToDevice
	);

	cudaMemcpy(
		device_b, host_b,
		size,
		cudaMemcpyHostToDevice
	);

	add_arrays_gpu <<<1, count >>> (device_a, device_b, device_c);

	cudaMemcpy(
		host_c, device_c,
		size,
		cudaMemcpyDeviceToHost
	);


	for (auto i = 0; i < count; i ++)
	{
		printf("%d ", host_c[i]);
	}

	getchar();

	return 0;
}

We marked the add_arrays_gpu function as __global__. Using CUDA terminology, this function is a kernel, and that will be executed from the GPU. It runs on the GPU, called from the CPU (there is another qualifier location qualifier, __device__ that should be used with functions that run on GPU, called from the GPU).

__global__ functions are invoked using the special <<<…>>> syntax. The parameters we used indicates that we are running the kernel function in 1 block of count (5) threads. We will talk more about it in the future.

Note that we don’t need to inform an index anymore – we retrieve the index position from threadIdx.x. threadIdx is a special variable, provided by CUDA runtime, that informs the position of the current thread in the thread block (we are using five threads, one thread block to run our code. Because of that, we were able to use this information as the index in the array).

I would like to make a special consideration about an vital pattern that you will see a lot of times when doing CUDA programming. To run, GPU code is not allowed to access CPU memory (and vice-versa). So we will need:

copy the data we need to process from the CPU to the GPU memory
do the processing
copy results from the GPU to the CPU memory.

Final words

I know, I know! Adding two arrays with only five elements each is not an exciting example. But, in this post, we did the setup to use CUDA on our machines. Then, I helped you to move your could from sequential running on CPU to parallel on GPU. In the future, let’s return to it and get some real benefits.

Cover image: Jean Gerber

Compartilhe este insight: