6626070
2997924

PL00, Parallel computing

Back to the previous pagepage management
List of posts to read before reading this article


Contents


Introduction to CUDA Programming

Hardward Bus standards
CPU pci-e
GPU nv-link




The history of high-performance computing

image





Heterogeneous computing

image





Low latency versus higher throughput

image





GPU architecture

image

hardware software
CUDA Core/SIMD code CUDA thread
Streaming multiprocessor CUDA block
GPU device GRID/kernel





CUDA Memory Management

image image




Global memory/device memory

Coalesced versus uncoalesced global memory access

image image image image image image





Shared memory

Bank conflicts and its effect on shared memory

image image image image





Read-only data/cache

image





Registers in GPU

image





Pinned memory





Unified memory

image




Understanding unified memory page allocation and transfer

  1. First, we need to allocate new pages on the GPU and CPU (first-touch basis). If the page is not present and mapped to another, a device page table page fault occurs. When *x, which resides in page 2, is accessed in the GPU that is currently mapped to CPU memory, it gets a page fault. Take a look at the following diagram: image

  2. In the next step, the old page on the CPU is unmapped, as shown in the following diagram: image

  3. Next, the data is copied from the CPU to the GPU, as shown in the following diagram: image

  4. Finally, the new pages are mapped on the GPU, while the old pages are freed on the CPU, as shown in the following diagram: image






CUDA Thread Programming

CUDA threads, blocks, and the GPU

image image image





Understanding parallel reduction

image




Naive parallel reduction using global memory

image





Reducing kernels using shared memory

image





Minimizing the CUDA warp divergence effect

image




Determining divergence as a performance bottleneck

Interleaved addressing
image

Sequential addressing
image





Performance modeling and balancing the limiter

The Roofline model

image





Warp-level primitive programming

image image image




Parallel reduction with warp primitives

image image





Cooperative Groups for flexible thread handling

Cooperative Groups in a CUDA thread block

image




Benefits of Cooperative Groups

Modularity
image





Atomic operations

image





Low/mixed precision operations

Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)

image






Kernel Execution Model and Optimization Strategies

Kernel execution with CUDA streams

The usage of CUDA streams

image image




Stream-level synchronization

image




Working with the default stream

image





Pipelining the GPU execution

Concept of GPU pipelining

image image image




Building a pipelining execution

image image





The CUDA callback function

image





CUDA streams with priority

Stream execution with priorities

image





Kernel execution time estimation using CUDA events

Using CUDA events

image





CUDA dynamic parallelism

Usage of dynamic parallelism

image





Grid-level cooperative groups

Understanding grid-level cooperative groups

image





CUDA kernel calls with OpenMP

CUDA kernel calls with OpenMP

image





Multi-Process Service

image




Enabling MPS

image




Profiling an MPI application and understanding MPS operation

image image image





Kernel execution overhead comparison

Comparison of three executions

image





CUDA Application Profiling and Debugging





Scalable Multi-GPU Programming

Solving a linear equation using Gaussian elimination

Single GPU hotspot analysis of Gaussian elimination

image





GPUDirect peer to peer

image image image image




Single node – multi-GPU Gaussian elimination

image





GPUDirect RDMA

image image




CUDA-aware MPI

image




Multinode – multi-GPU Gaussian elimination

image





CUDA streams

Application 1 – using multiple streams to overlap data transfers with kernel execution

image image




Application 2 – using multiple streams to run kernels on multiple devices

image





Additional tricks

Collective communication acceleration using NCCL

image image





Parallel Programming Patterns in CUDA

Matrix multiplication optimization

image image




Performance analysis of the tiling approach

image





Convolution

Convolution operation in CUDA

image




Optimization strategy

image image





Prefix sum (scan)

image image




Building a global size scan

image





Compact and split

image image image image





N-body

Implementing an N-body simulation on GPU

image





Histogram calculation

Understanding a parallel histogram

image





Quicksort and CUDA dynamic parallelism

Quicksort in CUDA using dynamic parallelism

image image





Radix sort






Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

level in cuBLAS operation
level 1 vector-vector
level 2 matrix-vector
level 3 matrix-matrix









GPU Programming Using OpenACC

OpenACC directives

Parallel and loop directives

image image




Data directive

image image





Asynchronous programming in OpenACC

image




Applying the unstructured data and async directives to merge image code

image





Additional important directives and clauses

Gang/vector/worker

image





Deep Learning Acceleration with CUDA





List of posts followed by this article

  • Jaegeun Han, Bharatkumar Sharma - Learn CUDA Programming_ A beginner’s guide to GPU programming and parallel computing with CUDA 10.x and C_C++-Packt Publishing (2019)
  • post2
  • post3

Reference


OUTPUT
<details markdown="1">
<summary class='jb-small' style="color:red">OUTPUT</summary>
<hr class='division3_1'>
<hr class='division3_1'>
</details>