HeadlinesBriefing favicon HeadlinesBriefing.com

GPU-Resident Top-K CUDA Kernel Eliminates PCIe Bottleneck in Agentic RAG Pipelines

Towards Data Science •
×

Anubhab Banerjee built a CUDA kernel that keeps vector similarity search resident on GPU memory, eliminating the PCIe transfer bottleneck that slows agentic RAG systems. The 343-line implementation bypasses CPU processing for the retrieval step, achieving up to 8.6x speedup on a GTX 1080 compared to traditional CPU-based approaches.

Standard RAG pipelines ship query embeddings from GPU to Python, compute dot products against millions of corpus rows on CPU, then return the top-K results. This round-trip creates unnecessary latency when agents make multiple retrieval calls per reasoning step. Banerjee's approach uploads the corpus to VRAM once, keeping scoring and selection entirely on-device.

Benchmark results show consistent wins across 45 configurations with N ranging from 10k to 1M corpus rows. At K=8, the GPU-resident approach outperforms CPU baselines by 2.43x to 8.57x, particularly excelling with larger datasets. The kernel uses a simple architecture: row_dot_scores_kernel, partial_topk_block_kernel, and merge_partial_topk_kernel.

This work builds on Banerjee's Production-Grade Agentic Inference series, which previously addressed redundant prefill and multi-agent GPU sharing. Drawing from 5G RAN engineering experience where similar beam selection problems exist, the author demonstrates that keeping retrieval on-device transforms what seemed like a compute problem into a data movement optimization.