The UI interactions that still break computer-use agents
97 canonical component types. 2,910 verified tasks. Four observation spaces. One benchmark to diagnose where computer-use agents fail on modern web UIs.
Screenshot + accessibility tree text
Same task. Different interface. Different result.
Why this benchmark
Long-horizon benchmarks hide root causes
When an agent fails a multi-step workflow, which step broke? ComponentBench isolates the answer at the component level.
Measured against human trajectories
Every task ships with cold and warm human recordings. Agents are scored against real human time and step counts — not just pass / fail.
One brittle interaction sinks a workflow
Five steps at 80% reliability = 33% end-to-end success. ComponentBench measures the interactions that actually matter.
14 Interaction Families
From buttons to rich text editors — every component type a computer-use agent might encounter
How Models Compare
Pass rates across observation spaces on ComponentBench-Full
| Model | Browser-Use | AX-tree | SoM | Pixel |
|---|---|---|---|---|
| Gemini 3 Flash | 95.2% | 89.6% | 87.1% | 85.4% |
| GPT-5.4 | 90.4% | 81.5% | 77% | 83.8% |
| Gemini 3.1 Flash-Lite | 87.4% | 77.7% | 73.5% | 63.3% |
| GPT-5 mini | 87% | 83.1% | 78.5% | 49% |
| GPT-5.4 mini | 85.8% | 79.1% | 74.7% | 77.1% |
| Qwen3-VL-235B | — | 77% | 54.4% | 50.5% |
| UI-TARS-1.5-7B | — | — | — | 12.6% |
How It Works
Ontology
97 canonical types organized into 14 interaction families from WAI-ARIA patterns
Implementation
Each type implemented across Ant Design, MUI, and Mantine as live Next.js pages
Verification
Every task executed twice by a human annotator, cleaned into reference traces
Evaluation
Agents tested under AX-tree, SoM, Pixel, and Browser-Use observation spaces
Diagnosis
Three-layer diagnostic pipeline: task packets → component reports → family analysis