HeadlinesBriefing favicon HeadlinesBriefing.com

DiffuJudge‑AV: Calibrated AV Video Evaluation with Tweedie Denoising

Towards Data Science •
×

DiffuJudge‑AV emerges as a new framework that treats LLM‑judge scores as noisy sensor readings, then applies a one‑step Tweedie denoiser to recover calibrated estimates. The system exposes judges to seven controlled perturbations—order swaps, rubric paraphrases, temperature changes, and frame shuffles—before aggregating 22 noisy observations per scenario. This approach targets safety‑critical AV evaluation for efficient deployment.

In a benchmark of 28,400 judgments on Wayve’s LingoQA dataset, the open‑source Qwen2.5‑VL‑7B outperformed larger closed models. It achieved Pearson r = 0.857, Spearman ρ = 0.856, quadratic‑weighted κ = 0.837, MAE = 0.57, and fail‑detection F1 = 0.712. The results show that calibration and uncertainty can outweigh raw scale. Thus the framework provides a path forward.

DiffuJudge‑AV’s denoising step relies on Tweedie’s posterior mean, leveraging a Gaussian KDE over sampled scores and a precision‑weighted mean across perturbations. The resulting posterior uncertainty feeds a split‑conformal interval that snaps to ordinal boundaries, turning a single numeric score into a calibrated decision: pass, fail, or human‑review for safety reviews, ensuring that outputs are automated efficiently.

For teams that must triage thousands of AV clips daily, DiffuJudge‑AV delivers a scalable evaluation pipeline that flags unsafe predictions with quantified risk. By exposing judges to controlled noise and correcting for it, the framework turns opaque model judgments into actionable metrics, tightening the safety net that governs autonomous‑driving releases for manufacturers and regulators alike.