HeadlinesBriefing favicon HeadlinesBriefing.com

RTX 3090 Runs Llama 70B via NVMe Direct Streaming

Hacker News •
×

A developer has created NTransformer, an open-source C++/CUDA inference engine enabling Llama 3.1 70B to run on a single consumer-grade RTX 3090 GPU. The project achieves this through a 3-tier adaptive caching system that streams model layers between VRAM, RAM, and NVMe storage, bypassing the CPU entirely for faster inference.

The implementation delivers 0.2 tokens per second for the 70B model using just 23.1GB of VRAM, representing a 33x speedup over traditional memory-mapped approaches. By implementing a custom CUDA kernel suite and eliminating external dependencies like PyTorch, the project maximizes hardware utilization while maintaining compatibility with various quantization formats.

This innovation democratizes access to large language models, previously requiring expensive multi-GPU setups. The NVMe direct I/O capability enables seamless streaming of model weights directly to GPU memory, though performance remains limited by PCIe bandwidth. The project positions consumer hardware as viable infrastructure for running cutting-edge AI models without enterprise-level investment.