HeadlinesBriefing favicon HeadlinesBriefing.com

ByteDance Drops 3B-Parameter Multimodal Model for Image and Video

Hacker News •
×

ByteDance released Lance, a 3B active-parameter native unified multimodal model built for image and video understanding, generation, and editing within a single framework. The transformer backbone was trained entirely from scratch on fewer than 128 A100 GPUs, with only the ViT and VAE encoders carried over from existing work.

Lance supports six task modes: text-to-video, text-to-image, image editing, video editing, image understanding, and video understanding. A unified CLI handles all inference tasks, and the team provides example prompts for each mode. The project requires Python 3.10+, CUDA 12.4+, and a GPU with at least 40GB VRAM for inference.

What makes Lance noteworthy is its efficiency at scale. At 3B active parameters trained on 128 A100 GPUs, it delivers strong results across image generation, editing, and video benchmarks. ByteDance has open-sourced the code on GitHub, published the paper on arXiv, and made model weights available on Hugging Face.