HeadlinesBriefing favicon HeadlinesBriefing.com

GBDTs Dominate Payment Fraud Detection Over LLMs in New Benchmark

Towards Data Science •
×

A new benchmark puts to rest a persistent question in payments engineering: can LLM agents handle real-time transaction scoring? The answer is definitively no. Sandeep Muthu built a reproducible test comparing gradient-boosted decision trees against simulated LLM inference for payment authorization, using synthetic ISO 8583-shaped data with 20 features per transaction.

The latency gap is stark. On a single CPU core, the GBDT scorer achieves p99 latency of 0.15 milliseconds, while the calibrated LLM simulator hits p99 around 1,212 milliseconds. This matters because ISO 8583 authorization budgets run roughly 100 milliseconds total, with inference being just one stage among network transit and feature lookup. The cost differential is equally dramatic: processing 50,000 transactions per second for one hour costs about $54 for GBDT versus $16,200 for GPT-4o-mini and $351,000 for frontier models.

Beyond raw performance, determinism proves problematic for hot-path scoring. Five hundred calls with identical inputs produced one distinct GBDT score but 498 different LLM outputs, even with temperature set to zero. Hosted inference remains non-deterministic, creating validation nightmares in regulated financial decisions. The benchmark doesn't dismiss LLMs entirely—agents excel at asynchronous cold-path tasks like SAR drafting and evidence gathering through MCP-typed tools.

Instead, it recommends a hybrid architecture: classical ML for the synchronous hot path where milliseconds matter, LLM agents for the investigative cold path where reasoning and context justify the latency and cost. This separation respects both technical constraints and regulatory requirements while leveraging each approach's strengths.