HeadlinesBriefing favicon HeadlinesBriefing.com

CUDA Kernel Compilation Pipeline Revealed: From PTX to GPU Execution

Hacker News •
×

A deep dive into CUDA vector addition shows the complex journey from source code to GPU execution. The simple kernel that computes c[i] = a[i] + b[i] for a million floats actually triggers tens of millions of CPU instructions and hundreds of device interactions before delivering results.

The nvcc compiler orchestrates multiple stages: host code goes to the system compiler while device code passes through cicc (LLVM-based) to generate PTX, then ptxas transforms PTX into SASS for the specific GPU architecture. PTX serves as a virtual ISA with infinite registers, while SASS contains the actual machine instructions for RTX 4090 hardware.

The compilation reveals interesting optimizations. Three PTX instructions for address calculation collapse into single IMAD.WIDE operations in SASS. Special registers like SR_CT AID.X and SR_TID.X feed into arithmetic through S2R instructions, with kernel arguments packed into constant memory for efficient broadcast access across all threads.

The fatbinary format bundles both SASS and PTX, enabling forward compatibility. When the pre-compiled SASS doesn't match available hardware, the driver JIT-compiles the PTX into fresh machine code, ensuring the same executable runs across different NVIDIA GPU generations.