HeadlinesBriefing favicon HeadlinesBriefing.com

Apple's VSSFlow AI Generates Speech and Sound from Silent Videos

9to5Mac •
×

Apple researchers have developed VSSFlow, an AI model that can generate both sound effects and speech from silent videos using a single unified system. The breakthrough addresses a long-standing challenge where video-to-sound models typically struggle with speech generation, while text-to-speech models fail at producing non-speech sounds. Unlike previous attempts that trained speech and sound separately, VSSFlow demonstrates that joint training actually improves performance on both tasks.

The architecture leverages generative AI concepts including flow-matching to convert random noise into desired audio signals. A 10-layer system blends video and transcript signals directly into audio generation, allowing the model to handle both sound effects and speech within one framework. During training, VSSFlow learned from three data types: silent videos with environmental sounds, silent talking videos with transcripts, and text-to-speech data.

To generate audio from a silent video, VSSFlow starts with random noise and uses visual cues sampled at 10 frames per second to shape ambient sounds while a transcript guides the generated voice. When tested against task-specific models, VSSFlow delivered competitive results across both tasks despite using a single unified system. The researchers have open-sourced the code on GitHub and are working to release the model's weights, with plans for an inference demo.

Quick Fact: VSSFlow uses a 10-layer architecture to blend video and transcript signals into audio generation.