HeadlinesBriefing favicon HeadlinesBriefing.com

Two Brothers Open-Source Text-to-Video Model

Hacker News: Front Page •
×

Sahil and Manu, two brothers, spent two years building a text-to-video model from scratch and are releasing it under Apache 2.0. Their 2B parameter model generates 2-5 seconds of footage at 360p or 720p. They claim better motion capture and aesthetics than Alibaba's comparable 1.3B model, positioning it as a stepping stone toward state-of-the-art results.

Their first model in early 2024 was a 180p GIF bot built on Stable Diffusion XL, which revealed the limitations of image VAEs for temporal coherence. For this version, they used a DiT-variant backbone trained with flow matching, T5 for text encoding, and a third-party VAE. The core challenge was building effective data curation pipelines, including hand-labeling and fine-tuning VLMs for filtering.

The brothers argue that owning the model's development is essential for building the product they want, as existing models like Sora or Veo can't support specific user needs like character consistency or camera controls. Their roadmap includes post-training for physics, distillation for speed, adding audio, and scaling the model. They've published their lab notes and are soliciting feedback from the developer community.