Introduction to Heterogenous Parallel Programming Using CUDA
Duration: 5 Days
Course Background
CUDA (Compute Unified Device Architecture) is a parallel computing platform comgined with a programming model. It was developed by NVIDIA to provide programmers with direct access to the virtual instruction set and memory of the parallel computational elements in CUDA capable GPUs. C/C++ programmers develop code using 'CUDA C/C++' and compiling it with "nvcc", NVIDIA's LLVM-based C/C++ compiler. This intensive course is designed to get programmers up to speed with designing and writing C and C++ code that can take advantage of the parallel computation capabilities offered by CUDA. A final section of the course will also consider the potential benefits of parallel programming using clusters of servers with powerful GPS CUDA capable systems.
Course Prerequisites and Target Audience
Attendees should be experienced C/C++ programmers with a sound knowledge of
- Pointers and pointer operations
- Computing using multidimensional arrays
- File I/O and file and directory manipulation
- Data structures and classes
- Collection classes such as linked lists and vectors
- Multithreading
- Memory allocation and memory management
- Computer architectures and instruction sets
- Basic techniques for code profiling and code optimisation
- Code debugging
Course Outline
- Overview of GPU computing
- Data-parallel architectures and the GPU programming model
- GPU memory model and thread cooperation
- Introduction to CUDA
- Different memory and variable types
- Control flow and synchronisation
- GPU memory management, simple CUDA kernels and shared memory and constant memory
- Launching a kernel, copying data to/from the graphics card, error checking and printing from kernel code
- Asynchronous operations, CUDA features, overview of CUFFT, CUBLAS, Thrust, and debugging
- Warp shuffles and reduction / scan operations
- Multiple GPUs
- Introduction to Optimisation
- Arithmetic optimizations, occupancy calculationsĀ and memory access patterns
- CUDA profiling
- Resource management, latency and occupancy
- Memory performance optimizations
- Asynchronous operations
- Profiling applications
- OpenACC
- Case studies and labs
- Monte Carlo simulation using NVIDIA's CURAND library
- Constant memory, random number generation, kernel timing, minimising device memory bandwidth requirements
- 3D Laplace and ADI finite difference solvers
- Thread block size optimisation - dynamic shared memory, thread synchronisation and reduction
- Working with the CUBLAS and CUFFT libraries
- Solving tri-diagonal equations - libraries, templates and g++
- Scan operations and recurrence relations
- Pattern matching
- Auto tuning
- Deplying CUDA systems in the cloud
- Coarse grained vs. fine grained parallelisation
- CUDA and OpenFoam - an overview