HeadlinesBriefing favicon HeadlinesBriefing.com

ICLR 2026 Affiliation Pipeline Delivers Clean Dataset and Treemap

Hacker News •
×

Dmytro Lopushanskyy released an end‑to‑end pipeline that converts every PDF from the 5,356 papers accepted at ICLR 2026 into a clean institutional‑affiliation dataset. By extracting names from the title block instead of OpenReview profiles, the tool avoids the “current job” drift that mislabels papers. The output includes a CSV, XLSX, and a treemap ranking the top 50 contributors.

The repository ships several CSVs: iclr2026_public with authors, normalized institutions, countries and OpenReview URLs; iclr2026_institutions_ranked_unique listing each institution’s unique‑paper count; and a fractional‑credit version for sensitivity analysis. A side‑by‑side spreadsheet shows how different counting methods affect rankings, revealing that the top institutions remain stable across the board.

Parsing succeeds on 96 % of papers by handling four common layout patterns in ICLR templates, then normalizing about 250 abbreviation variants—e.g., collapsing MIT / Massachusetts Institute of Technology into a single entry. The resulting treemap, available in PNG and SVG, offers a visual snapshot of who is shaping AI research now, with industry and academia distinguished by shading.

Users can regenerate the chart with a single command or rebuild the entire pipeline in under two hours of network time, requiring roughly 5 GB of disk for PDFs. The project is MIT‑licensed, encouraging reuse in other conferences, and demonstrates how open‑source tooling can produce high‑quality, publication‑ready research metadata.