PL00, Parallel computing
Back to the previous page|page management 
List of posts to read before reading this article
Contents
- Introduction to CUDA Programming
- CUDA Memory Management
- CUDA Thread Programming
- Kernel Execution Model and Optimization Strategies    - Kernel execution with CUDA streams
- Pipelining the GPU execution
- The CUDA callback function
- CUDA streams with priority
- Kernel execution time estimation using CUDA events
- CUDA dynamic parallelism
- Grid-level cooperative groups
- CUDA kernel calls with OpenMP
- Multi-Process Service
- Kernel execution overhead comparison
 
- CUDA Application Profiling and Debugging
- Scalable Multi-GPU Programming
- Parallel Programming Patterns in CUDA
- Programming with Libraries and Other Languages
- GPU Programming Using OpenACC
- Deep Learning Acceleration with CUDA
Introduction to CUDA Programming
| Hardward | Bus standards | 
|---|---|
| CPU | pci-e | 
| GPU | nv-link | 
The history of high-performance computing

Heterogeneous computing

Low latency versus higher throughput

GPU architecture

| hardware | software | 
|---|---|
| CUDA Core/SIMD code | CUDA thread | 
| Streaming multiprocessor | CUDA block | 
| GPU device | GRID/kernel | 
CUDA Memory Management
 

Global memory/device memory
Coalesced versus uncoalesced global memory access
 
 
 
 
 

Shared memory
Bank conflicts and its effect on shared memory
 
 
 

Read-only data/cache

Registers in GPU

Pinned memory
Unified memory

Understanding unified memory page allocation and transfer
- 
    First, we need to allocate new pages on the GPU and CPU (first-touch basis). If the page is not present and mapped to another, a device page table page fault occurs. When *x, which resides in page 2, is accessed in the GPU that is currently mapped to CPU memory, it gets a page fault. Take a look at the following diagram:  
- 
    In the next step, the old page on the CPU is unmapped, as shown in the following diagram:  
- 
    Next, the data is copied from the CPU to the GPU, as shown in the following diagram:  
- 
    Finally, the new pages are mapped on the GPU, while the old pages are freed on the CPU, as shown in the following diagram:  
CUDA Thread Programming
CUDA threads, blocks, and the GPU
 
 

Understanding parallel reduction

Naive parallel reduction using global memory

Reducing kernels using shared memory

Minimizing the CUDA warp divergence effect

Determining divergence as a performance bottleneck
Interleaved addressing

Sequential addressing

Performance modeling and balancing the limiter
The Roofline model

Warp-level primitive programming
 
 

Parallel reduction with warp primitives
 

Cooperative Groups for flexible thread handling
Cooperative Groups in a CUDA thread block

Benefits of Cooperative Groups
Modularity

Atomic operations

Low/mixed precision operations
Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)

Kernel Execution Model and Optimization Strategies
Kernel execution with CUDA streams
The usage of CUDA streams
 

Stream-level synchronization

Working with the default stream

Pipelining the GPU execution
Concept of GPU pipelining
 
 

Building a pipelining execution
 

The CUDA callback function

CUDA streams with priority
Stream execution with priorities

Kernel execution time estimation using CUDA events
Using CUDA events

CUDA dynamic parallelism
Usage of dynamic parallelism

Grid-level cooperative groups
Understanding grid-level cooperative groups

CUDA kernel calls with OpenMP
CUDA kernel calls with OpenMP

Multi-Process Service

Enabling MPS

Profiling an MPI application and understanding MPS operation
 
 

Kernel execution overhead comparison
Comparison of three executions

CUDA Application Profiling and Debugging
Scalable Multi-GPU Programming
Solving a linear equation using Gaussian elimination
Single GPU hotspot analysis of Gaussian elimination

GPUDirect peer to peer
 
 
 

Single node – multi-GPU Gaussian elimination

GPUDirect RDMA
 

CUDA-aware MPI

Multinode – multi-GPU Gaussian elimination

CUDA streams
Application 1 – using multiple streams to overlap data transfers with kernel execution
 

Application 2 – using multiple streams to run kernels on multiple devices

Additional tricks
Collective communication acceleration using NCCL
 

Parallel Programming Patterns in CUDA
Matrix multiplication optimization
 

Performance analysis of the tiling approach

Convolution
Convolution operation in CUDA

Optimization strategy
 

Prefix sum (scan)
 

Building a global size scan

Compact and split
 
 
 

N-body
Implementing an N-body simulation on GPU

Histogram calculation
Understanding a parallel histogram

Quicksort and CUDA dynamic parallelism
Quicksort in CUDA using dynamic parallelism
 

Radix sort
Programming with Libraries and Other Languages
Linear algebra operation using cuBLAS
| level in cuBLAS | operation | 
|---|---|
| level 1 | vector-vector | 
| level 2 | matrix-vector | 
| level 3 | matrix-matrix | 
GPU Programming Using OpenACC
OpenACC directives
Parallel and loop directives
 

Data directive
 

Asynchronous programming in OpenACC

Applying the unstructured data and async directives to merge image code

Additional important directives and clauses
Gang/vector/worker

Deep Learning Acceleration with CUDA
List of posts followed by this article
- Jaegeun Han, Bharatkumar Sharma - Learn CUDA Programming_ A beginner’s guide to GPU programming and parallel computing with CUDA 10.x and C_C++-Packt Publishing (2019)
- post2
- post3
Reference
OUTPUT
<details markdown="1">
<summary class='jb-small' style="color:red">OUTPUT</summary>
<hr class='division3_1'>
<hr class='division3_1'>
</details>