PL00, Parallel computing
Back to the previous page|page management
List of posts to read before reading this article
Contents
- Introduction to CUDA Programming
- CUDA Memory Management
- CUDA Thread Programming
- Kernel Execution Model and Optimization Strategies
- Kernel execution with CUDA streams
- Pipelining the GPU execution
- The CUDA callback function
- CUDA streams with priority
- Kernel execution time estimation using CUDA events
- CUDA dynamic parallelism
- Grid-level cooperative groups
- CUDA kernel calls with OpenMP
- Multi-Process Service
- Kernel execution overhead comparison
- CUDA Application Profiling and Debugging
- Scalable Multi-GPU Programming
- Parallel Programming Patterns in CUDA
- Programming with Libraries and Other Languages
- GPU Programming Using OpenACC
- Deep Learning Acceleration with CUDA
Introduction to CUDA Programming
Hardward | Bus standards |
---|---|
CPU | pci-e |
GPU | nv-link |
The history of high-performance computing
Heterogeneous computing
Low latency versus higher throughput
GPU architecture
hardware | software |
---|---|
CUDA Core/SIMD code | CUDA thread |
Streaming multiprocessor | CUDA block |
GPU device | GRID/kernel |
CUDA Memory Management
Global memory/device memory
Coalesced versus uncoalesced global memory access
Shared memory
Bank conflicts and its effect on shared memory
Read-only data/cache
Registers in GPU
Pinned memory
Unified memory
Understanding unified memory page allocation and transfer
-
First, we need to allocate new pages on the GPU and CPU (first-touch basis). If the page is not present and mapped to another, a device page table page fault occurs. When *x, which resides in page 2, is accessed in the GPU that is currently mapped to CPU memory, it gets a page fault. Take a look at the following diagram:
-
In the next step, the old page on the CPU is unmapped, as shown in the following diagram:
-
Next, the data is copied from the CPU to the GPU, as shown in the following diagram:
-
Finally, the new pages are mapped on the GPU, while the old pages are freed on the CPU, as shown in the following diagram:
CUDA Thread Programming
CUDA threads, blocks, and the GPU
Understanding parallel reduction
Naive parallel reduction using global memory
Reducing kernels using shared memory
Minimizing the CUDA warp divergence effect
Determining divergence as a performance bottleneck
Interleaved addressing
Sequential addressing
Performance modeling and balancing the limiter
The Roofline model
Warp-level primitive programming
Parallel reduction with warp primitives
Cooperative Groups for flexible thread handling
Cooperative Groups in a CUDA thread block
Benefits of Cooperative Groups
Modularity
Atomic operations
Low/mixed precision operations
Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)
Kernel Execution Model and Optimization Strategies
Kernel execution with CUDA streams
The usage of CUDA streams
Stream-level synchronization
Working with the default stream
Pipelining the GPU execution
Concept of GPU pipelining
Building a pipelining execution
The CUDA callback function
CUDA streams with priority
Stream execution with priorities
Kernel execution time estimation using CUDA events
Using CUDA events
CUDA dynamic parallelism
Usage of dynamic parallelism
Grid-level cooperative groups
Understanding grid-level cooperative groups
CUDA kernel calls with OpenMP
CUDA kernel calls with OpenMP
Multi-Process Service
Enabling MPS
Profiling an MPI application and understanding MPS operation
Kernel execution overhead comparison
Comparison of three executions
CUDA Application Profiling and Debugging
Scalable Multi-GPU Programming
Solving a linear equation using Gaussian elimination
Single GPU hotspot analysis of Gaussian elimination
GPUDirect peer to peer
Single node – multi-GPU Gaussian elimination
GPUDirect RDMA
CUDA-aware MPI
Multinode – multi-GPU Gaussian elimination
CUDA streams
Application 1 – using multiple streams to overlap data transfers with kernel execution
Application 2 – using multiple streams to run kernels on multiple devices
Additional tricks
Collective communication acceleration using NCCL
Parallel Programming Patterns in CUDA
Matrix multiplication optimization
Performance analysis of the tiling approach
Convolution
Convolution operation in CUDA
Optimization strategy
Prefix sum (scan)
Building a global size scan
Compact and split
N-body
Implementing an N-body simulation on GPU
Histogram calculation
Understanding a parallel histogram
Quicksort and CUDA dynamic parallelism
Quicksort in CUDA using dynamic parallelism
Radix sort
Programming with Libraries and Other Languages
Linear algebra operation using cuBLAS
level in cuBLAS | operation |
---|---|
level 1 | vector-vector |
level 2 | matrix-vector |
level 3 | matrix-matrix |
GPU Programming Using OpenACC
OpenACC directives
Parallel and loop directives
Data directive
Asynchronous programming in OpenACC
Applying the unstructured data and async directives to merge image code
Additional important directives and clauses
Gang/vector/worker
Deep Learning Acceleration with CUDA
List of posts followed by this article
- Jaegeun Han, Bharatkumar Sharma - Learn CUDA Programming_ A beginner’s guide to GPU programming and parallel computing with CUDA 10.x and C_C++-Packt Publishing (2019)
- post2
- post3
Reference
OUTPUT
<details markdown="1">
<summary class='jb-small' style="color:red">OUTPUT</summary>
<hr class='division3_1'>
<hr class='division3_1'>
</details>