HeadlinesBriefing favicon HeadlinesBriefing.com

Kog AI pushes GPU LLM decoding to 3,000 tokens/s

Hacker News •
×

Kog AI unveiled a tech preview of its Kog Inference Engine, delivering 3,000 output tokens per second on a single request using eight AMD MI300X GPUs and 2,100 tokens/s on eight NVIDIA H200 cards. The engine runs a 2‑billion‑parameter model in FP16 without speculative decoding, and a live playground lets developers test the speed instantly. The benchmark targets enterprise AI labs seeking latency‑critical services.

According to the post, single‑request decoding is limited by memory bandwidth, not raw FLOPS. An eight‑GPU node supplies roughly 30 TB/s of effective bandwidth, enough to stream a 4 GB weight matrix and approach a theoretical ceiling of 7,000–8,400 tokens per second. Kog’s stack co‑designs model architecture, runtime and low‑level kernels to keep the pipeline saturated. This approach also reduces CPU‑GPU synchronization overhead dramatically.

The speed boost matters for autonomous AI agents that iterate through planning, coding and testing loops. Generating 50,000 tokens at 100 tokens/s takes eight minutes, whereas 3,000 tokens/s finishes in under twenty seconds, dramatically expanding feasible workflows. Such latency gains enable real‑time code suggestions within IDEs. Kog’s preview proves that standard datacenter GPUs can match dedicated inference cards without proprietary lock‑in.