HeadlinesBriefing favicon HeadlinesBriefing.com

Proxy‑Pointer RAG Delivers Grounded Multimodal Answers Without Embeddings

Towards Data Science •
×

Proxy-Pointer RAG shifts how multimodal answers surface by treating a document as a hierarchical tree of semantic blocks instead of fragmented chunks. This tree‑based approach lets the system retrieve whole sections, keeping figures and tables in context. Engineers can now embed images as artifacts and let the LLM decide relevance without multimodal embeddings.

In a prototype built from five CC‑BY research papers, the team extracted 270 figures and tables using the Adobe PDF Extract API. Text embeddings came from Gemini-embedding-001, trimmed to 1536 dimensions, while the synthesis and re‑ranking stages ran on Gemini-3.1-flash-lite-preview. FAISS indexed the resulting vectors for fast similarity search.

Key to success is that the LLM never sees raw images; it only knows an image exists within a fully retrieved section. By injecting the document’s breadcrumb path—e.g., ‘Galore > 3.1 Zero Convolution’—into each chunk, the model can judge relevance purely from context. This mirrors human reading habits and eliminates noisy cross‑modal similarity.

The prototype proves that multimodal RAG can scale without costly multimodal embeddings, reducing memory footprint and inference latency. By treating sections as atomic units, the system guarantees that returned images are grounded in the source document, a feature enterprise chatbots have long struggled to deliver. This approach offers a practical path to richer, more trustworthy AI assistants.