HeadlinesBriefing favicon HeadlinesBriefing.com

Smart PDFs Deliver Clean Markdown to LLMs While Preserving Visual Layout

Hacker News •
×

Most PDFs floating online lack structural markup, creating headaches when LLMs try to parse them. The format stores glyph positions and font sizes, but text extractors must guess where headings end and paragraphs begin. This worked fine when humans were sole readers, but now ChatGPT and Claude regularly ingest PDFs and struggle with broken line wraps and flattened tables.

A developer leveraged a forgotten PDF 1.4 feature to solve this. The spec includes replacement text for marked content—originally meant for ligatures and non-Unicode characters. By attaching structured markdown to content streams, PyMuPDF and Poppler return clean hierarchies while PDF viewers display normal formatting. One file serves both audiences without conversion steps.

Testing showed single-digit size overhead across resumes, textbooks, and research papers. Token counts remained roughly identical, but structure became explicit rather than inferred. Both ChatGPT and Claude returned markdown with proper headings and bullet points when fed smart PDFs, matching the embedded layer exactly.

The result is adaptive documentation: humans see familiar layouts while machines receive parseable structure. No separate versions, no manual tagging. The author plans a Google Docs extension to automate this process.