HeadlinesBriefing favicon HeadlinesBriefing.com

OCRBase: Open-source PDF to Markdown/JSON API

Hacker News: Front Page •
×

A new open-source project called OCRBase converts PDFs into structured Markdown or JSON using PaddleOCR and LLM-powered parsing. The tool offers a TypeScript SDK with React hooks, real-time WebSocket updates, and a queue-based system for processing thousands of documents. It's designed for developers needing to extract structured data from scanned files at scale.

Self-hostable and built with modern tools like Bun, the system requires a CUDA GPU with 12GB+ VRAM for deployment. This approach gives teams full control over their document processing pipeline, avoiding cloud service dependencies and costs. The architecture prioritizes type safety and real-time job tracking for production workflows.

The project arrives amid growing demand for automated document processing in enterprise applications. By combining open-weight OCR models with LLM parsing, it offers a flexible alternative to proprietary APIs. Developers can now build custom data extraction pipelines without vendor lock-in, though the hardware requirements may limit some use cases.