HeadlinesBriefing favicon HeadlinesBriefing.com

Baidu Releases Unlimited-OCR for Advanced Long-Context Document Parsing

Hacker News •
×

Baidu has released Unlimited-OCR on GitHub, introducing a model designed for one-shot long-horizon document parsing. The system aims to advance beyond existing Deepseek-OCR capabilities, handling complex document layouts with extended context windows. Available through Model Scope and HuggingFace, it targets researchers and developers working with multi-page documents.

The implementation supports both single-image and multi-page PDF processing through HuggingFace transformers on NVIDIA GPUs. Users can choose between 'gundam' and 'base' configurations with different resolution settings. Single images support both modes while PDFs use the base configuration. The model processes up to 32,768 tokens with specialized n-gram repetition controls.

For production deployment, Unlimited-OCR integrates with SGLang inference serving, offering OpenAI-compatible API endpoints with custom logit processors. The repository includes batch processing scripts for handling image directories or PDF files concurrently. Installation requires Python 3.12.3 with CUDA 12.9 support and specific package versions.

The project builds on foundations from Deepseek-OCR and Paddle OCR, representing ongoing progress in document AI. A research paper is available on arXiv with full citation details. This release demonstrates how open-source OCR models continue evolving toward more sophisticated document understanding tasks.