HeadlinesBriefing favicon HeadlinesBriefing.com

CODA Turns Transformer Operators Into GEMM Kernels for Faster Training

Hacker News •
×

Transformer training is bottlenecked by memory-bound operators like normalization, residual updates, and activations that shuttle large tensors through global memory with little arithmetic. Researchers introduced CODA, a GPU kernel abstraction that rewrites these computations as GEMM-plus-epilogue programs, running them while GEMM output tiles stay on chip before any memory write occurs.

CODA fixes the GEMM mainloop and exposes a small set of composable epilogue primitives covering scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves expert-written GEMM performance while expressing nearly all non-attention computation across forward and backward passes of standard Transformer blocks.

Both human- and LLM-authored CODA kernels achieve high performance across representative Transformer workloads. The approach shows that GEMM-epilogue programming can bridge framework-level productivity with hardware-level efficiency without sacrificing speed.