HeadlinesBriefing favicon HeadlinesBriefing.com

PA Bench Tests AI Agents on Multi-App Workflows

Hacker News •
×

Vibrant Labs has introduced PA Bench, a new benchmark designed to evaluate how well AI agents handle real-world personal assistant tasks across multiple web applications. The benchmark tests frontier models on their ability to complete multi-step workflows involving email and calendar applications, addressing a critical gap in existing AI agent evaluations. Current benchmarks typically focus on isolated tasks like adding items to shopping carts or creating single calendar events.

Traditional AI agent benchmarks miss the mark because they don't reflect how humans actually use personal assistants in practice. Real-world tasks require agents to understand context, switch between applications, reason over distributed information, and take coordinated actions to achieve meaningful goals. PA Bench tackles this by creating high-fidelity simulations of Gmail and Calendar that agents must navigate to complete tasks like finding flight confirmations and scheduling meetings.

The benchmark tested major models including Claude Opus 4.6, Gemini 3 Pro, Gemini 3 Flash, and OpenAI Computer Use. Results showed significant performance differences, with Claude Opus achieving a 68.8% success rate compared to OpenAI's 12.5%. The benchmark's simulation-based approach enables reproducible testing by providing deterministic environments where success can be programmatically verified.