HeadlinesBriefing favicon HeadlinesBriefing.com

Cactus-Compute's 26M Function-Calling Model Runs on Budget Devices

Hacker News •
×

Cactus-Compute's Needle model, a 26M parameter function-calling AI, achieves 6000 tokens/second prefill and 1200 tokens/second decode speeds on consumer hardware like smartphones and smartwatches. Unlike larger models, its Simple Attention Network architecture eliminates multi-layer perceptrons (MLPs), relying solely on attention and gating mechanisms. This design choice proves effective for tool use tasks requiring retrieval-and-assembly operations rather than complex reasoning.

The model's training involved 200B tokens across 16 TPU v6e (27 hours) followed by 2B tokens of synthetic function-calling data generated via Gemini (45 minutes). It outperforms FunctionGemma-270M, Qwen-0.6B, and LFM2.5-350M in single-shot tool use scenarios while maintaining compatibility with external knowledge systems like RAG. The architecture's efficiency stems from its focus on cross-attention primitives rather than memorization of structured knowledge.

Needle's open-source release includes MIT-licensed weights on Hugging Face and full documentation on GitHub. Developers can test the model via a web UI playground or finetune it locally using provided CLI tools. The framework integrates with Cactus, an inference engine optimized for edge devices, enabling practical applications in smart home systems, navigation, and messaging assistants.

This development challenges assumptions about model size requirements for agentic systems, demonstrating that lean architectures can deliver robust tool interaction capabilities without sacrificing performance. The project's GitHub repository includes detailed technical documentation and TPU management guides for researchers exploring similar efficiency-focused AI implementations.