HeadlinesBriefing favicon HeadlinesBriefing.com

Microsoft opens VibeVoice ASR on Transformers library

Hacker News •
×

Microsoft’s open‑source VibeVoice suite moved onto Hugging Face’s Transformers library this week, letting developers call the ASR model with a single API call. The release bundles VibeVoice‑ASR, a unified speech‑to‑text engine that ingests up to 60‑minute audio streams in one pass and returns structured data—speaker, timestamps and transcript—for rapid prototyping and evaluation.

First announced in January, VibeVoice‑ASR supports more than 50 languages and allows users to inject custom hotwords, improving domain‑specific accuracy. The codebase now includes vLLM‑compatible inference, cutting latency for long recordings, and a detailed technique report explains the continuous speech tokenizers that run at 7.5 Hz, and enables batch processing on GPUs. The same repository also hosts VibeVoice‑Realtime‑0.5B for streaming text‑to‑speech.

Researchers can also access VibeVoice‑TTS, a multi‑speaker model that synthesizes up to 90‑minute conversations with four distinct voices, though Microsoft removed the code after misuse concerns. All models inherit biases from the underlying Qwen2.5 1.5B backbone, prompting the team to stress responsible deployment and to forbid commercial use without further testing. The open‑source release gives the speech community a rare, end‑to‑end toolkit and accelerates research reproducibility.