HeadlinesBriefing favicon HeadlinesBriefing.com

LLM Workloads: Offline, Online, and Semi-Online

Hacker News: Front Page •
×

The era of flat per-token API pricing for LLMs is ending. Engineers must now understand three distinct workload types—offline, online, and semi-online—to properly architect systems. This shift is driven by open-source models and inference engines like vLLM and SGLang, which erode the benefits of proprietary APIs and demand deeper technical optimization.

Offline workloads, like batch data processing, prioritize throughput over latency. The challenge is maximizing output per dollar, which leverages GPU strengths for parallel tasks. vLLM is recommended here for its superior scheduling of mixed batches, combining prefill and decode phases efficiently across multiple requests to saturate compute resources.

Conversely, online workloads demand low latency for interactive human communication. For these, the recommendation is SGLang with speculative decoding on live Hopper/Blackwell GPUs. Semi-online workloads, which handle bursty streams from other systems, require flexible infrastructure for rapid autoscaling, using either engine to manage variable loads per replica.