HeadlinesBriefing favicon HeadlinesBriefing.com

RAG Question Parsing: Why Structure Beats Raw Search Queries

Towards Data Science •
×

Most RAG tutorials treat user questions as direct search queries, embedding them verbatim and retrieving top-k chunks. This approach misses critical nuance—a user asking about premium amounts and renewal deadlines might get partial answers when the system silently drops one component. The author argues that questions deserve structured treatment before hitting the retrieval layer.

The solution introduces question_df, a relational schema with five typed columns: keywords, scope, shape, decomposition, and clarification. This mirrors the document-side structure (line_df, toc_df, span_df), creating symmetry that enables precise joins. Instead of bloated prompt templates with special-case clauses, teams add capabilities via new columns—keeping complexity linear rather than quadratic.

Two derived briefs guide downstream processing: Retrieval Query focuses on actionable elements while Generation Brief specifies output format and exclusions. For compound questions like "Does the indemnification clause survive termination, and if so, for how long?", the system tags decomposition patterns (conditional, independent, sequential, unified) to prevent silent omissions.

An expert-maintained concept_keywords_df dictionary replaces embedding-based synonym matching, mapping user terminology directly to corpus vocabulary. Context windows size dynamically in lines—not characters or pages—based on detected answer shapes. This disciplined approach catches failures that plague production RAG systems when questions lack proper structure.