Writing Your First GPU Kernel in Python with Numba and CUDA

Image by Author | Ideogram

GPUs are great for tasks where you need to do the same operation across different pieces of data. This is known as the Single Instruction, Multiple Data (SIMD) approach. Unlike CPUs, which only have a few powerful cores, GPUs have thousands of smaller ones that can run these repetitive operations all at once. You will see this pattern a lot in machine learning, for example when adding or multiplying large vectors, because each calculation is independent. This is the ideal scenario for using GPUs to speed up tasks with parallelism.

NVIDIA created CUDA as a way for developers to write programs that run on the GPU instead of the CPU. It’s based on C and lets you write special functions called kernels that can run many operations at the same time. The problem is that writing CUDA in C or C++ isn’t exactly beginner-friendly. You have to deal with things like manual memory allocation, thread coordination, and understanding how the GPU works at a low level. This can be overwhelming especially if you’re used to writing code in Python.

This is where Numba can help you. It allows writing CUDA kernels with Python using the LLVM (Low Level Virtual Machine) compiler infrastructure to directly compile your Python code to CUDA-compatible kernels. With just-in-time (JIT) compilation, you can annotate your functions with a decorator, and Numba handles everything else for you.

In this article, we will use a common example of vector addition, and convert simple CPU code to a CUDA kernel with Numba. Vector addition is an ideal example of parallelism, as addition across a single index is independent of other indices. This is the perfect SIMD scenario so all indices can be added simultaneously to complete vector addition in one operation.

Note that you will require a CUDA GPU to follow this article. You can use Colab’s free T4 GPU or a local GPU with NVIDIA toolkit and NVCC installed.

# Setting Up the Environment and Installing Numba

Numba is available as a Python package, and you can install it with pip. Moreover, we will use numpy for vector operations. Set up the Python environment using the following commands:

python3 -m venv venv
source venv/bin/activate
pip install numba-cuda numpy

# Vector Addition on the CPU

Let’s take a simple example of vector addition. For two given vectors, we add the corresponding values from each index to get the final value. We will use numpy to generate random float32 vectors and generate the final output using a for loop.

import numpy as np 

N = 10_000_000 # 10 million elements 
a = np.random.rand(N).astype(np.float32) 
b = np.random.rand(N).astype(np.float32) 
c = np.zeros_like(a) # Output array 

def vector_add_cpu(a, b, c): 
    """Add two vectors on CPU""" 
    for i in range(len(a)): 
        c[i] = a[i] + b[i]

Here is a breakdown of the code:

Initialize two vectors each with 10 million random floating-point numbers
We also create an empty vector c to store the result
The vector_add_cpu function simply loops through each index and adds the elements from a and b, storing the result in c

This is a serial operation; each addition happens one after another. While this works fine, it’s not the most efficient approach, especially for large datasets. Since each addition is independent of the others, this is a perfect candidate for parallel execution on a GPU.

In the next section, you will see how to convert this same operation to run on the GPU using Numba. By distributing each element-wise addition across thousands of GPU threads, we can complete the task significantly faster.

# Vector Addition on the GPU with Numba

You will now use Numba to define a Python function that can run on CUDA, and execute it within Python. We are doing the same vector addition operation but now it can run in parallel for each index of the numpy array, leading to faster execution.

Here is the code for writing the kernel:

from numba import config

# Required for newer CUDA versions to enable linking tools. 
# Prevents CUDA toolkit and NVCC version mismatches.
config.CUDA_ENABLE_PYNVJITLINK = 1

from numba import cuda, float32

@cuda.jit
def vector_add_gpu(a, b, c):
	"""Add two vectors using CUDA kernel"""
	# Thread ID in the current block
	tx = cuda.threadIdx.x
	# Block ID in the grid
	bx = cuda.blockIdx.x
	# Block width (number of threads per block)
	bw = cuda.blockDim.x

	# Calculate the unique thread position
	position = tx + bx * bw

	# Make sure we don't go out of bounds
	if position < len(a):
    	    c[position] = a[position] + b[position]

def gpu_add(a, b, c):
	# Define the grid and block dimensions
	threads_per_block = 256
	blocks_per_grid = (N + threads_per_block - 1) // threads_per_block

	# Copy data to the device
	d_a = cuda.to_device(a)
	d_b = cuda.to_device(b)
	d_c = cuda.to_device(c)

	# Launch the kernel
	vector_add_gpu[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

	# Copy the result back to the host
	d_c.copy_to_host(c)

def time_gpu():
	c_gpu = np.zeros_like(a)
	gpu_add(a, b, c_gpu)
	return c_gpu

Let’s break down what is happening above.

// Understanding the GPU Function

The @cuda.jit decorator tells Numba to treat the following function as a CUDA kernel; a special function that will run in parallel across many threads on the GPU. At runtime, Numba will compile this function to CUDA-compatible code and handle the C-API transpilation for you.

@cuda.jit
def vector_add_gpu(a, b, c):
	...

This function will run on thousands of threads at the same time. But we need a way to figure out which part of the data each thread should work on. That’s what the next few lines do:

tx is the thread’s ID within its block
bx is the block’s ID within the grid
bw is how many threads there are in a block

We combine these to calculate a unique position, which tells each thread which element of the arrays it should add. Note that the threads and blocks might not always provide a valid index, as they operate in powers of 2. This may lead to invalid indices when the vector length is not conforming to the underlying architecture. Therefore, we add a guard condition to validate the index, before we perform the vector addition. This prevents any out-of-bound runtime error when accessing the array.

Once we know the unique position, we can now add the values just like we did for the CPU implementation. The following line will match the CPU implementation:

c[position] = a[position] + b[position]

// Launching the Kernel

The gpu_add function sets things up:

It defines how many threads and blocks to use. You can experiment with different values of block and thread sizes, and print the corresponding values in the GPU kernel. This can help you understand how underlying GPU indexing works.
It copies the input arrays (a, b, and c) from the CPU memory to the GPU memory, so the vectors are accessible in the GPU RAM.
It runs the GPU kernel with vector_add_gpu[blocks_per_grid, threads_per_block].
Finally, it copies the result back from the GPU into the c array, so we can access the values on the CPU.

# Comparing the Implementations and Potential Speedup

Now that we have both the CPU and GPU versions of vector addition, it’s time to see how they compare. It is important to verify the results and the execution boost we can get with CUDA parallelism.

import timeit

c_cpu = time_cpu()
c_gpu = time_gpu()

print("Results match:", np.allclose(c_cpu, c_gpu))

cpu_time = timeit.timeit("time_cpu()", globals=globals(), number=3) / 3
print(f"CPU implementation: {cpu_time:.6f} seconds")

gpu_time = timeit.timeit("time_gpu()", globals=globals(), number=3) / 3
print(f"GPU implementation: {gpu_time:.6f} seconds")

speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")

First, we run both implementations and check if their results match. This is important to make sure our GPU code is working correctly and the output should be the same as the CPU’s.

Next, we use Python’s built-in timeit module to measure how long each version takes. We run each function a few times and take the average to get a reliable timing. Finally, we calculate how many times faster the GPU version is compared to the CPU. You should see a big difference because the GPU can do many operations at once, while the CPU handles them one at a time in a loop.

Here is the expected output on NVIDIA’s T4 GPU on Colab. Note that the exact speedup can differ based on CUDA versions and the underlying hardware.

Results match: True
CPU implementation: 4.033822 seconds
GPU implementation: 0.047736 seconds
GPU speedup: 84.50x

This simple test helps demonstrate the power of GPU acceleration and why it’s so useful for tasks involving large amounts of data and parallel work.

# Wrapping Up

And that is it. You have now written your first CUDA kernel with Numba, without actually writing any C or CUDA code. Numba allows a simple interface for using the GPU through Python, and it makes it much simpler for Python engineers to get started with CUDA programming.

You can now use the same template to write advanced CUDA algorithms, which are prevalent in machine learning and deep learning. If you find a problem following the SIMD paradigm, it is always a good idea to use GPU to improve execution.

The complete code is available on Colab notebook that you can access here. Feel free to test it out and make simple changes to get a better understanding of how CUDA indexing and execution works internally.

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

Writing Your First GPU Kernel in Python with Numba and CUDA

# Setting Up the Environment and Installing Numba

# Vector Addition on the CPU

# Vector Addition on the GPU with Numba

// Understanding the GPU Function

// Launching the Kernel

# Comparing the Implementations and Potential Speedup

# Wrapping Up

Related Posts

YouTube Music Tests “your Daily Discover” On IOS App

Google Tests Folder Backup Feature For Android Users

Leave a Reply Cancel reply