HeadlinesBriefing favicon HeadlinesBriefing.com

Token‑Saving Tactics for Agentic AI Deployments

Towards Data Science •
×

Running large language models in production quickly drains budgets. A single agent can start with a 500‑token system prompt but balloon to tens of thousands; leaked Claude prompts hit 24 k tokens, GPT‑5 around 15 k. Users have reported over 150 k input tokens to Gemini 3.1 Pro for a handful of output tokens, inflating monthly spend to roughly $996 for enterprises.

To curb that bill, the article proposes four engineering principles, the first two centered on reusing tokens. Prompt caching stores pre‑computed key/value tensors for static prompt sections, letting subsequent calls skip the expensive prefill step. Semantic caching adds a layer of similarity matching but requires exact token matches. Self‑hosted stacks like vLLM offer prefix‑caching flags, in large while OpenAI and Anthropic expose API‑level controls.

Routing requests to smaller models when possible and delegating subtasks to specialized sub‑agents further trims token counts. Cleaning stale tool definitions and compacting conversation history prevents unnecessary payloads from persisting across turns. The author’s interactive calculators show that applying these tactics can shave hundreds of thousands of input tokens per day, turning a five‑figure monthly bill into a modest expense.