Accelerating Physics Calculations with CUDA in Python using Numba: A Monte-Carlo Example

Description

Numba is a scientific computing framework for accelerating compute-heavy applications written in Python and NumPy. Numba achieves this acceleration by 'just-in-time' (JIT)-compiling NumPy-aware Python code to low-level machine code. Numba also has support for GPU kernel compilation for NVidia CUDA, and AMD ROCm. In this application, we explore the numba.cuda module for a CUDA-enabled NVidia GPU.

Problem description

We consider a nonlinear pendulum with a double-well potential driven by periodic forcing with a correlated red-noise component. We solve the problem approximately using Monte Carlo simulations to approximate the statistics of the results, where several simulation ensemble sizes are chosen and ran using the code explored below.

For further in-depth mathematical and physics background information on the problem, see the formulation here.

Getting Started

On a GPU-enabled HPC Linux machine, set up your python environment using Anaconda (via the 'conda' command) as follows:

module load anaconda
conda create --yes --name numba_gpu -c anaconda cudatoolkit numba
conda activate numba_gpu

The Code (CPU only)

Before attempting to accelerate our numerical solution code using GPU computing, it is instructive to implement a CPU solution code as you normally would if you didn't have access to a GPU. A CPU implementation also provides a useful baseline performance benchmark. The code details assume that the NumPy Python library is installed and readily available for use.

Click on 'Run Pen' to view the CPU-based implementation in Python 3 with NumPy.

See the Pen Python Code - Main Time Step Loop (CPU) by Derek Steinmoeller (@dsteinmo) on CodePen.

Notes:

The function 'do_steps' carries out discrete time stepping, assuming the time-derivatives of the two DE's have first been discretized in a finite difference approximation.
Each time-step begins with generating several new random numbers, one for each ensemble member.
The correlated red-noise is then calculated using Bartosch's Memory [1]; this red-noise is then stored in a separate variable for use in the next loop iteration.
32-bit floating point real numbers (single-precision) are used in all cases, for later convenience in the GPU implementation.

GPU Kernel Code (CUDA)

Using the fact that every simulation in the ensemble is independent, we can parallelize the solution procedure using a grid of GPU threads, with each thread representing a unit of compute work. The problem of size \( N_{ensemble} \) can be defined on a one-dimensional grid split into \(N_{blocks} \) with \(N_{threads}\) per thread-block such that \( N_{ensemble} = N_{blocks} \times N_{threads} \). The implementation details are given in the codepen below.

Click on 'Run Pen' to view the python code that gets compiled by the Numba library into a GPU kernel.

See the Pen Numba CUDA Python Example by Derek Steinmoeller (@dsteinmo) on CodePen.

Notes:

The function 'do_steps_gpu_kernel' carries out the same discrete stepping as the CPU code above, with the goal being obtaining faster results.
The arrays that store the (independent) ensemble members' states must be copied to the GPU before the kernel is invoked.
A 1D GPU grid of threads gets created to align with the arrays described in the above bullet-point. Each thread evolves a single member from the ensemble forward in the time-dimension.
Numba provides the GPU-friendly normally distributed random-number generator, based on the xoroshiro128 method, which is used here.
While the previous (CPU) code-listing uses lambda functions to represent the potential \(V(x)\) and its derivative, this is not possible within a Numba GPU kernel since Numba syntax is not 'lamba-aware'. Thus, we must 'unroll' the syntax for these functions and inline the result.
We again explicitly declare all numbers and arrays as np.float32, so as to gain additional performance benefits on the GPU.

Performance Comparison

Here, a comparison is carried out to illustrate the performance benefits of GPU-based parallelism. Note:

For smaller ensemble sizes, the acceleration factor is only 2 — 4x.
For larger ensembles, the code experiences speed-up by a factor of 8 — 12x.
This apparent discrepancy can be explained by the performance gains due to CPU caching at the smaller problem sizes, since the smaller memory allocations may fit entirely inside the local cache improving memory access performance [2].

Table comparing performance on CPU and GPU
Architecture / Ensemble Size	128	1024	2048	8192	16384	32768
CPU (numpy) Run time (seconds)	4.26	7.27	9.99	28.03	50.82	96.86
GPU (numba.cuda) Run time (seconds)	1.94	1.97	2.11	3.08	4.54	8.10

Getting the code

The source code is available in git or as a zip archive.

Please follow the instructions in Getting Started to create an appropriate Anaconda environment for executing the code. Once the code is on a GPU-enabled machine with the anaconda environment activated, it can be executed with:

> python stochastic_double_well_gpu.py

or,

> python stochastic_double_well_cpu.py  # CPU-only version.

Getting help

To run CUDA-based python code in an MFCF-supported compting environment, please refer to the MFCF documentation on Speciality Research Linux Servers. The code will need to be submitted and executed using the Slurm Job Manager.

Contact MFCF for additional help.

References

[1] Bartosch, L. "Generation of Colored Noise." International Journal of Modern Physics C 12 (2001): 851-855.

[2] Slotin, S. "Algorithmica v3." https://en.algorithmica.org/hpc/cpu-cache/cache-lines/. Accessed 2023-11-09.

Notes:

The function 'do_steps_gpu_kernel' carries out the same discrete stepping as the CPU code above, with the goal being obtaining faster results.
The arrays that store the (independent) ensemble members' states must be copied to the GPU before the kernel is invoked.
A 1D GPU grid of threads gets created to align with the arrays described in the above bullet-point. Each thread evolves a single member from the ensemble forward in the time-dimension.
Numba provides the GPU-friendly normally distributed random-number generator, based on the xoroshiro128 method, which is used here.
While the previous (CPU) code-listing uses lambda functions to represent the potential \(V(x)\) and its derivative, this is not possible within a Numba GPU kernel since Numba syntax is not 'lamba-aware'. Thus, we must 'unroll' the syntax for these functions and inline the result.
We again explicitly declare all numbers and arrays as np.float32, so as to gain additional performance benefits on the GPU.

Performance Comparison

Here, a comparison is carried out to illustrate the performance benefits of GPU-based parallelism. Note:

For smaller ensemble sizes, the acceleration factor is only 2 — 4x.
For larger ensembles, the code experiences speed-up by a factor of 8 — 12x.
This apparent discrepancy can be explained by the performance gains due to CPU caching at the smaller problem sizes, since the smaller memory allocations may fit entirely inside the local cache improving memory access performance [2].

Table comparing performance on CPU and GPU
Architecture / Ensemble Size	128	1024	2048	8192	16384	32768
CPU (numpy) Run time (seconds)	4.26	7.27	9.99	28.03	50.82	96.86
GPU (numba.cuda) Run time (seconds)	1.94	1.97	2.11	3.08	4.54	8.10

Getting the code

The source code is available in git or as a zip archive.


> python stochastic_double_well_gpu.py

or,

> python stochastic_double_well_cpu.py  # CPU-only version.

Getting help

Contact MFCF for additional help.

References

[1] Bartosch, L. "Generation of Colored Noise." International Journal of Modern Physics C 12 (2001): 851-855.

[2] Slotin, S. "Algorithmica v3." https://en.algorithmica.org/hpc/cpu-cache/cache-lines/. Accessed 2023-11-09.

Description

Problem description

Getting Started

The Code (CPU only)

Notes:

GPU Kernel Code (CUDA)

Notes:

Performance Comparison

Getting the code

Getting help

References

Performance Comparison

Getting the code

Getting help

References

Departments/Schools

Inquiries

Suggestions