HeadlinesBriefing favicon HeadlinesBriefing.com

Study gauges ChatGPT‑4's econometrics code accuracy

Towards Data Science •
×

Researchers at Health Economics Review tested ChatGPT‑4.0 Pro on three core causal‑inference techniques across Python, R and Stata. Using problem sets from Scott Cunningham’s *Causal Inference: The Mixtape*, they asked the model to code a Difference‑in‑Differences analysis of abortion law changes, an IPTW estimator, and a regression‑discontinuity design. The goal was to see if AI can replace hand‑written econometrics.

The study departed from earlier, subjective assessments by running the generated scripts and comparing outputs to the textbook benchmarks. Accuracy was measured against reference results in R 3.6.0, Stata 18 and Python 3.13. Including Stata mattered because many health‑economics scholars still rely on that environment, a language rarely examined in AI‑coding research.

Prompt design involved four seasoned econometricians who crafted standardized, context‑rich requests for each language. They first gave simple problem statements, then expanded to full workflows including data cleaning, variable creation, model fitting and figure generation. This tiered approach mimics real research pipelines, exposing whether the model can handle end‑to‑end coding without manual intervention.

Results showed mixed performance: the model reproduced benchmark outputs for the Difference‑in‑Differences task in Python and R, but faltered in Stata, generating syntax errors that required manual edits. IPTW and regression‑discontinuity scripts displayed lower accuracy across all three languages, suggesting current LLMs still need expert oversight for complex econometric code.