The UI interactions that still break computer-use agents

97 canonical component types. 2,910 verified tasks. Four observation spaces. One benchmark to diagnose where computer-use agents fail on modern web UIs.

97 types·14 families·2,910 tasks·912 Core
AX-tree

Screenshot + accessibility tree text

Gemini 3 Flash
89.6%
GPT-5.4
81.5%
Gemini 3.1 Flash-Lite
77.7%

Same task. Different interface. Different result.

Why this benchmark

Long-horizon benchmarks hide root causes

When an agent fails a multi-step workflow, which step broke? ComponentBench isolates the answer at the component level.

Measured against human trajectories

Every task ships with cold and warm human recordings. Agents are scored against real human time and step counts — not just pass / fail.

One brittle interaction sinks a workflow

Five steps at 80% reliability = 33% end-to-end success. ComponentBench measures the interactions that actually matter.

>30pp
pass rate shift within one model
Changing observation space shifts GPT-5 mini from 87% to 49%
3.7×
slower than humans
Even the fastest agent configuration vs. human reference traces
<60%
on spatial manipulation
Sliders, splitters, and drag-and-drop remain unsolved

How Models Compare

Pass rates across observation spaces on ComponentBench-Full

ModelBrowser-UseAX-treeSoMPixel
Gemini 3 Flash95.2%89.6%87.1%85.4%
GPT-5.490.4%81.5%77%83.8%
Gemini 3.1 Flash-Lite87.4%77.7%73.5%63.3%
GPT-5 mini87%83.1%78.5%49%
GPT-5.4 mini85.8%79.1%74.7%77.1%
Qwen3-VL-235B77%54.4%50.5%
UI-TARS-1.5-7B12.6%

How It Works

1

Ontology

97 canonical types organized into 14 interaction families from WAI-ARIA patterns

2

Implementation

Each type implemented across Ant Design, MUI, and Mantine as live Next.js pages

3

Verification

Every task executed twice by a human annotator, cleaned into reference traces

4

Evaluation

Agents tested under AX-tree, SoM, Pixel, and Browser-Use observation spaces

5

Diagnosis

Three-layer diagnostic pipeline: task packets → component reports → family analysis