HeadlinesBriefing favicon HeadlinesBriefing.com

kapa.ai's Efficient Image Indexing for RAG Pipelines

Hacker News •
×

kapa.ai builds AI assistants that answer questions from technical documentation, processing millions of images including screenshots and architecture diagrams. Rather than sending images to models at query time, they describe each image once during indexing using a cheap vision model. These text descriptions are stored and retrieved alongside regular text chunks, keeping per-query overhead between 1% and 6% above text-only systems.

Their analysis of real customer questions revealed two types of images in technical docs: illustrative ones that clarify existing text, and load-bearing ones that contain unique information like wiring diagrams or spec tables. Testing showed image context significantly improved answer quality across customer projects, with users getting specific paths and screenshots instead of generic instructions.

Query-time multimodal processing proved economically unfeasible, adding 27% to 51% per-query costs while hitting payload limits. Microsoft's research team independently reached the same conclusion: describe images at ingestion, store as separate chunks. This approach works because the expensive vision processing happens once, not repeatedly.

Production implementation required careful filtering since most images are junk like logos and decorative banners. A zero-shot classifier using multimodal embeddings removes clear junk while tolerating ambiguous cases. Small vision models produce captions nearly identical to expensive ones, making them the practical choice for scale. Storing captions as separate chunks rather than inline reduces costs while improving relevance.