HeadlinesBriefing favicon HeadlinesBriefing.com

Google's Gemini Embeddings 2 Preview: Multi-Modal AI Breakthrough

Towards Data Science •
×

Google has unveiled Gemini Embeddings 2 Preview, a groundbreaking embedding model that processes text, PDFs, images, audio, and video in a single framework. This multi-modal capability represents a significant leap beyond traditional embedding models, which typically handle only text and documents. The preview model enables true cross-media retrieval augmented generation (RAG) workflows.

RAG technology relies on embedding to convert content into searchable vectors, stored in vector databases for similarity matching. When users search, their queries are also embedded and compared using cosine similarity to find relevant matches. Google's new model eliminates the need for separate embedding systems for different media types, streamlining AI development pipelines and reducing complexity.

Input limitations currently restrict text to 8,192 tokens, images to six per request, and video to two minutes. Testing demonstrates the model's effectiveness—correctly matching "yellow animal" to dolphin images and "fishing story" audio to visual content. Developers can access the model through Google's AI Studio with Python integration using the google-genai library. This unified approach could transform how AI systems process and retrieve information across diverse content formats.