HeadlinesBriefing favicon HeadlinesBriefing.com

Docling: IBM's Local PDF Parser for Enterprise RAG Systems

Towards Data Science •
×

IBM Research released Docling, an open-source document parser that processes PDFs entirely on local machines without cloud uploads. The tool extracts tables with full cell structure, performs OCR on scanned documents, and recovers text from figures while keeping sensitive data in-house. This addresses compliance barriers that prevent enterprises from using cloud-based document intelligence services.

Unlike traditional cloud services charging per-page fees, Docling requires only a one-time model download and runs offline thereafter. Its pipeline combines layout detection, TableFormer for table structure recognition, and optional OCR through EasyOCR, PaddleOCR, Tesseract, or RapidOCR. The system produces identical relational table outputs as Azure Document Intelligence and PyMuPDF parsers, ensuring compatibility with existing RAG pipelines.

For insurance contracts, medical records, M&A data rooms, and other confidential documents, Docling eliminates the legal and residency constraints that block cloud processing. Enterprises trade per-page costs and compliance reviews for CPU compute and initial setup time. The tradeoff favors organizations handling sensitive data at scale.

The parser integrates seamlessly into multi-engine architectures through a consistent table dictionary format. Code wrappers abstract the underlying engine, allowing downstream components to process outputs without knowing whether Docling, fitz, or Azure produced them. This engine-agnostic approach simplifies switching between local and cloud processing based on document sensitivity.