PL00, Parallel computing

Back to the previous page｜page management
List of posts to read before reading this article

Introduction to CUDA Programming
CUDA Memory Management
CUDA Thread Programming
Kernel Execution Model and Optimization Strategies
CUDA Application Profiling and Debugging
Scalable Multi-GPU Programming
Parallel Programming Patterns in CUDA
Programming with Libraries and Other Languages
- Linear algebra operation using cuBLAS
GPU Programming Using OpenACC
Deep Learning Acceleration with CUDA

Introduction to CUDA Programming

Hardward	Bus standards
CPU	pci-e
GPU	nv-link

The history of high-performance computing

Heterogeneous computing

Low latency versus higher throughput

GPU architecture

hardware	software
CUDA Core/SIMD code	CUDA thread
Streaming multiprocessor	CUDA block
GPU device	GRID/kernel

CUDA Memory Management

Global memory/device memory

Coalesced versus uncoalesced global memory access

Shared memory

Bank conflicts and its effect on shared memory

Read-only data/cache

Registers in GPU

Pinned memory

Unified memory

Understanding unified memory page allocation and transfer

First, we need to allocate new pages on the GPU and CPU (first-touch basis). If the page is not present and mapped to another, a device page table page fault occurs. When *x, which resides in page 2, is accessed in the GPU that is currently mapped to CPU memory, it gets a page fault. Take a look at the following diagram:
In the next step, the old page on the CPU is unmapped, as shown in the following diagram:
Next, the data is copied from the CPU to the GPU, as shown in the following diagram:
Finally, the new pages are mapped on the GPU, while the old pages are freed on the CPU, as shown in the following diagram:

CUDA Thread Programming

CUDA threads, blocks, and the GPU

Understanding parallel reduction

Naive parallel reduction using global memory

Reducing kernels using shared memory

Minimizing the CUDA warp divergence effect

Determining divergence as a performance bottleneck

Interleaved addressing

Sequential addressing

Performance modeling and balancing the limiter

The Roofline model

Warp-level primitive programming

Parallel reduction with warp primitives

Cooperative Groups for flexible thread handling

Cooperative Groups in a CUDA thread block

Benefits of Cooperative Groups

Modularity

Atomic operations

Low/mixed precision operations

Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)

Kernel Execution Model and Optimization Strategies

Kernel execution with CUDA streams

The usage of CUDA streams

Stream-level synchronization

Working with the default stream

Pipelining the GPU execution

Concept of GPU pipelining

Building a pipelining execution

The CUDA callback function

CUDA streams with priority

Stream execution with priorities

Kernel execution time estimation using CUDA events

Using CUDA events

CUDA dynamic parallelism

Usage of dynamic parallelism

Grid-level cooperative groups

Understanding grid-level cooperative groups

CUDA kernel calls with OpenMP

Multi-Process Service

Enabling MPS

Profiling an MPI application and understanding MPS operation

Kernel execution overhead comparison

Comparison of three executions

CUDA Application Profiling and Debugging

Scalable Multi-GPU Programming

Solving a linear equation using Gaussian elimination

Single GPU hotspot analysis of Gaussian elimination

GPUDirect peer to peer

Single node – multi-GPU Gaussian elimination

GPUDirect RDMA

CUDA-aware MPI

Multinode – multi-GPU Gaussian elimination

CUDA streams

Application 1 – using multiple streams to overlap data transfers with kernel execution

Application 2 – using multiple streams to run kernels on multiple devices

Additional tricks

Collective communication acceleration using NCCL

Parallel Programming Patterns in CUDA

Matrix multiplication optimization

Performance analysis of the tiling approach

Convolution

Convolution operation in CUDA

Optimization strategy

Prefix sum (scan)

Building a global size scan

Compact and split

N-body

Implementing an N-body simulation on GPU

Histogram calculation

Understanding a parallel histogram

Quicksort and CUDA dynamic parallelism

Quicksort in CUDA using dynamic parallelism

Radix sort

Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

level in cuBLAS	operation
level 1	vector-vector
level 2	matrix-vector
level 3	matrix-matrix

GPU Programming Using OpenACC

OpenACC directives

Parallel and loop directives

Data directive

Asynchronous programming in OpenACC

Applying the unstructured data and async directives to merge image code

Additional important directives and clauses

Gang/vector/worker

Deep Learning Acceleration with CUDA

List of posts followed by this article

Jaegeun Han, Bharatkumar Sharma - Learn CUDA Programming_ A beginner’s guide to GPU programming and parallel computing with CUDA 10.x and C_C++-Packt Publishing (2019)
post2
post3

Reference

post3

OUTPUT

<details markdown="1">
<summary class='jb-small' style="color:red">OUTPUT</summary>
<hr class='division3_1'>
<hr class='division3_1'>
</details>

6626070 2997924

PL00, Parallel computing

Contents

Introduction to CUDA Programming

The history of high-performance computing

Heterogeneous computing

Low latency versus higher throughput

GPU architecture

CUDA Memory Management

Global memory/device memory

Coalesced versus uncoalesced global memory access

Shared memory

Bank conflicts and its effect on shared memory

Read-only data/cache

Registers in GPU

Pinned memory

Unified memory

Understanding unified memory page allocation and transfer

CUDA Thread Programming

CUDA threads, blocks, and the GPU

Understanding parallel reduction

Naive parallel reduction using global memory

Reducing kernels using shared memory

Minimizing the CUDA warp divergence effect

Determining divergence as a performance bottleneck

Performance modeling and balancing the limiter

The Roofline model

Warp-level primitive programming

Parallel reduction with warp primitives

Cooperative Groups for flexible thread handling

Cooperative Groups in a CUDA thread block

Benefits of Cooperative Groups

Atomic operations

Low/mixed precision operations

Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)

Kernel Execution Model and Optimization Strategies

Kernel execution with CUDA streams

The usage of CUDA streams

Stream-level synchronization

Working with the default stream

Pipelining the GPU execution

Concept of GPU pipelining

Building a pipelining execution

The CUDA callback function

CUDA streams with priority

Stream execution with priorities

Kernel execution time estimation using CUDA events

Using CUDA events

CUDA dynamic parallelism

Usage of dynamic parallelism

Grid-level cooperative groups

Understanding grid-level cooperative groups

CUDA kernel calls with OpenMP

CUDA kernel calls with OpenMP

Multi-Process Service

Enabling MPS

Profiling an MPI application and understanding MPS operation

Kernel execution overhead comparison

Comparison of three executions

CUDA Application Profiling and Debugging

Scalable Multi-GPU Programming

Solving a linear equation using Gaussian elimination

Single GPU hotspot analysis of Gaussian elimination

GPUDirect peer to peer

Single node – multi-GPU Gaussian elimination

GPUDirect RDMA

CUDA-aware MPI

Multinode – multi-GPU Gaussian elimination

CUDA streams

Application 1 – using multiple streams to overlap data transfers with kernel execution

Application 2 – using multiple streams to run kernels on multiple devices

Additional tricks

Collective communication acceleration using NCCL

Parallel Programming Patterns in CUDA

Matrix multiplication optimization

Performance analysis of the tiling approach

Convolution

Convolution operation in CUDA

Optimization strategy

Prefix sum (scan)

6626070
2997924