HeadlinesBriefing favicon HeadlinesBriefing.com

Dual RTX 5080 + 3090 Setup Hits 80+ Tokens/Second on Qwen 3.6

Hacker News •
×

A developer achieved impressive local LLM performance by pairing an RTX 5080 with a refurbished RTX 3090, reaching 80-90 tokens per second on Qwen 3.6 27B Q8. The setup addresses VRAM limitations that 16GB couldn't satisfy for larger model quantizations.

The Asus Prime X570-Pro motherboard enables PCIe 16x slot splitting into dual 8x connections, crucial for running both GPUs simultaneously. BIOS configuration requires disabling CSM and enabling Above 4G Decoding with ReSize BAR Support. The author discovered Nvidia's driver documentation scattered across Tesla URLs, complicating installation.

For software, llama.cpp needed specific CUDA architecture flags (86 for Ampere, 120 for Blackwell) with NCCL disabled for optimal performance. The model uses tensor parallelism across both cards with speculative decoding via MTP. Command-line flags include ngram-mod,draft-mtp for acceleration and proper VRAM allocation.

Results show consistent 80+ token throughput with peak speeds near 90 tokens/second. The configuration validates that thoughtful hardware combinations can match cloud performance for local inference, though mixed GPU generations limit some advanced features like open-gpu-kernel-modules compatibility.