HeadlinesBriefing favicon HeadlinesBriefing.com

Pure C Speech-to-Text: Mistral Voxtral Realtime 4B CPU-Only Inference

Hacker News: Front Page •
×

A developer has released voxtral.c, a pure C implementation of Mistral AI's Voxtral Realtime 4B speech-to-text model that runs entirely on CPU without external dependencies. The project provides both a standalone C inference engine and a simple Python reference implementation, addressing what the author sees as Mistral's limited accessibility by restricting inference to vLLM partnerships. The implementation supports Metal GPU acceleration on Apple Silicon for faster processing while maintaining a zero-dependency core.

The C implementation features memory-mapped weights for near-instant loading, chunked audio processing with overlapping windows to bound memory usage, and a rolling KV cache that automatically compacts when exceeding 8192 positions. Users can transcribe audio files, capture live microphone input on macOS, or pipe any format through ffmpeg. The streaming C API allows incremental audio feeding with tokens returned as they become available, making it suitable for real-time applications.

Performance varies significantly based on the processing interval setting (-I flag), with lower values increasing responsiveness but adding GPU overhead. The author notes the project needs more testing for production use but provides the core inference pipeline. This open-source implementation democratizes access to Voxtral's capabilities, offering developers a self-contained alternative to vLLM-dependent solutions while maintaining compatibility with the original model weights.