HeadlinesBriefing favicon HeadlinesBriefing.com

Lucebox releases hand‑tuned LLM kernels for RTX 3090

Hacker News •
×

Lucebox has opened its code hub, delivering hand‑tuned inference kernels for large language models on a single consumer GPU. The first release packs the 0.8‑billion‑parameter Qwen 3.5 model into a megakernel that runs all 24 layers in one CUDA launch, achieving 1.87 tokens per joule on an RTX 3090—roughly double the efficiency of Apple’s latest silicon.

Second project targets the 27‑billion‑parameter Qwen 3.5‑27B using DFlash speculative decoding and a DDTree verification structure. With a Q4_K_M GGUF file that fits 24 GB VRAM, the pipeline reaches 207 tokens per second in demo mode and 129.5 t/s on HumanEval, outpacing traditional autoregressive decoding by more than fivefold.

The repo supplies scripts for cloning, building with CUDA 12 and CMake 3.18, and streaming weights directly from HuggingFace, eliminating CPU‑GPU round‑trips. By exposing per‑chip optimizations as reproducible benchmarks, Lucebox argues that local AI can run on existing desktop hardware without vendor‑locked services. The hub now hosts two self‑contained projects and promises quarterly updates for additional GPUs and model families.

Roadmap entries list RTX 3090 kernel work through Q1 2026, followed by Ryzen AI MAX+ optimizations and heterogeneous CPU‑GPU latency reductions slated for later 2026. By publishing MIT‑licensed code and detailed benchmarks, Lucebox positions itself as a community‑driven alternative to monolithic frameworks that sacrifice hardware potential for generic support.