Writing lightning fast code with CUDA.

Before we dive into writing our first lightning fast application, we should cover some fundamental terminology. Additionally, you can find the CUDA installation guide and prerequisites here .

Parallel Computing:Refers to a type of computation architecture where we execute/calculate our processes concurrently.

GPU:Within the field of parallel computing we refer to our GPUs as devices.

CPU:Within the field of parallel computing we refer to our CPU as host.

CUDA:A framework and API developed by NVIDIA to help us build out applications using parallelism, by allowing us to execute our code on a NVIDIA GPU.

Thread:A chain of instructions which run on a CUDA core with a given index. You can have up to 32 CUDA threads running on a single CUDA core concurrently.

Block:A block is a collection of threads.

Grid:A grid is a collection of blocks.

Kernels:Functions launched by the host and executed on the device.

qA7R3yZ.png!web

Source

Let’s burn some GPUs!

All examples were run on a NVIDIA Tesla V100 GPU.

Hello World!

Let’s start by writing a simple C++ program which calculates the sum of two arrays with a million elements each.

#include <iostream>
#include <math.h>
#include <chrono>void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; 

  float *x = new float[N];
  float *y = new float[N];

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }  auto t1 = std::chrono::high_resolution_clock::now();
  add(N, x, y); 
auto t2 = std::chrono::high_resolution_clock::now();  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
    std::cout << "Max error: " << maxError << std::endl; 
std::cout << duration;  delete [] x;
  delete [] y;

  return 0;
}

This might seem like a lot at first, but let’s break it down together.

void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

All we are doing here is initializing a function which adds the elements of two arrays x and y .

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

Here we write our main function which creates an integer N with 1 million elements (1<<20). We also add these elements into our arrays.

for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

This for loop allows us to initialize our x and y array on the host.

auto t1 = std::chrono::high_resolution_clock::now();
add(N, x, y); 
auto t2 = std::chrono::high_resolution_clock::now();

Here we simply call our add function. Additionally we utilize chrono to allow us to measure the execution speed of our function.

float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
std::cout << duration;

This allows us to check for errors and print out the time it took to execute our function.

delete [] x;
delete [] y;

And finally this allows us to free up our memory after the function finished executing.

Compiling

Now that our ‘Hello World’ program is finished we’ll need to compile it. We can achieve this with a variety of tools depending on the OS. I am running this tutorial on Ubuntu and will be using the g++ utility. To compile your new c++ program simply run.

g++ helloWorld.cpp -o helloWorld

To run our program we execute the file as we would with any script by running.

./helloWorld

After running your helloWorld program you should see two outputs. The first indicating that we have no errors “Max error:0” and the second will be the execution speed of your function, in my case 3860 milliseconds.

We gotta go faster!

While 3860 milliseconds is not the slowest thing on earth we can go much faster. This is where we will start to use CUDA to help us execute our code on our device instead of the host.

__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

We do this by first adding a __global__ specifier to our add function. The specifier tells our CUDA compiler that this function should run on our device.

  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  ...

  // Free memory
  cudaFree(x);
  cudaFree(y);

We also need to allocate our data into unified memory which we can achieve by calling cudaMallocManaged . This replaces the original calls to new . Additionally, we also need to replace our delete call with cudaFree . This frees up the device memory after we are done using it.

add<<<1, 1>>>(N, x, y);

This particular function call launches one device block with one thread to run our add function.

cudaDeviceSynchronize();

Finally, we need to add a call to the cudaDeviceSynchronize function. This essentially tells the CPU to wait until our GPU is finished before accessing the results.

Compiling and test!

nvcc helloWorld.cu -o helloWorld

In order to compile a CUDA program, we will need to save our file with a .cu extension and compile the program with nvcc, the CUDA compiler utility.

nvprof ./helloWorld

CUDA also provides us with a neat little utility called nvprof, which allows us to time our functions execution speed.

You should get an output as follows.

NZzuQny.png!web

Single GPU thread output

We can see from the above output that our add function took 58.018 milliseconds! That is quite the improvement from our original 3860 milliseconds.

Parallelism

What we did above was run our add function on a single block and thread on our device. Let’s use parallelism to run this process by adding more threads to speed up the add function even more.

add<<<1, 256>>>(N, x, y);

Firstly we’ll need to update our add function call by changing the second parameter from 1 to 256. The second parameter defines the number of threads in a thread block.

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

Additionally, we will need to loop through the array with parallel threads. threadIdx.x contains the index of the current thread within its block and blockDim.x contains the number of threads in the block.

The above loop is called a Stride loop, and you can learn more about it here .

Let’s compile our file and run it again!

vMzq6nv.png!web

Output with thread parallelism

Another impressive time reduction from 58.018 ms to 3.166 ms.

More Blocks!

In order to fully utilize our threads we will need to launch the kernel with multiple thread blocks.

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

We set our blockSize to 256 and divide N by the blockSize in order to calculate the number of blocks we need to get N threads. We then also parameterize our add function call to take in our new parameters.

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

Additionally, we will need to update our kernel code to include the entire grid of thread blocks. Let’s break this down in detail.

gridDim.x:Contains the number of blocks in a grid.

blockIdx.x:Contains the index of the current thread block within the grid.

blockIdx.x * blockDim.x + threadIdx.x:The logic here is that we want to know the threads index by calculating its block offset and adding its index within the block.

MBFz6vA.png!web

Source

When we compile and run our code again we now get the following results!

uyaiMzf.png!web

Output with parallelism and more thread blocks

Slight performance increase… the reason why our performance increase is so minor here is due to the migration cost between the CPU and GPU being included in the kernel run time.

How can we fix this?

Unified Memory Prefetching

Unified Memory acts as a single memory address which is accessible from any processor. This allows our GPU to access any page in our entire memory and migrate that data into its own memory for a much higher throughput. You can find more information on Unified Memory here .

Prefetching essentially moves the data into our device’s memory before executing our function, and so saving us from the data migration overhead.

fiInEfv.png!web

Source

int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(x, N*sizeof(float), device, NULL);
cudaMemPrefetchAsync(y, N*sizeof(float), device, NULL);

We can implement Memory Prefetching by implementing the above code just before we run our add function.

YzQrEbB.png!web

Output with Parallelism and Memory Prefetching

Woah! 16.735 microseconds! That is certainly quite the improvement from the previous 2.96 milliseconds. Let’s finish by reviewing our results throughout this tutorial!

amyeUrA.png!web

Results

I hope you found this introductory post on CUDA programming insightful, and I would encourage you to start playing around with different arithmetic functions and seeing how fast you can calculate them!

In future posts we will be covering how to run the above on multiple GPUs and even dive into solving fluid simulations with CUDA — so feel free to follow me! :grin:

And as always — let me know if you have any questions! If you followed along with this tutorial, post your add function execution times into the comments below!

Let’s burn some GPUs!

Hello World!

Compiling

We gotta go faster!

Compiling and test!

Parallelism

More Blocks!

Unified Memory Prefetching

Recommend

用于测量磁盘活动的Linux命令

构建简单的地震监控系统

如何给程序中的变量起个好名字？

利用 js 文件进行渗透

这些年的体验技术部（四）· 平台前端 - 有技术，有业务，有爱与远方 - 知乎

Where should I walk?

为什么说2019年是云原生的关键节点？

100 个网络基础知识普及，看完成半个网络高手

GitHub - Angry-Potato/prettier-plugin-kotlin: A prettier plugin for the Kotlin p...

2019体育赛事流量红利报告：男篮输了？没关系，用户依旧在涨

About Joyk