The UI interactions that still break computer-use agents

97 canonical component types. 2,910 verified tasks. Four observation spaces. One benchmark to diagnose where computer-use agents fail on modern web UIs.

97 types·14 families·2,910 tasks·912 Core

Explore the Benchmark Compare Models

AX-tree

Screenshot + accessibility tree text

Gemini 3 Flash

89.6%

GPT-5.4

81.5%

Gemini 3.1 Flash-Lite

77.7%

Same task. Different interface. Different result.

Why this benchmark

Long-horizon benchmarks hide root causes

When an agent fails a multi-step workflow, which step broke? ComponentBench isolates the answer at the component level.

Measured against human trajectories

Every task ships with cold and warm human recordings. Agents are scored against real human time and step counts — not just pass / fail.

One brittle interaction sinks a workflow

Five steps at 80% reliability = 33% end-to-end success. ComponentBench measures the interactions that actually matter.

>30pp

pass rate shift within one model

Changing observation space shifts GPT-5 mini from 87% to 49%

3.7×

slower than humans

Even the fastest agent configuration vs. human reference traces

<60%

on spatial manipulation

Sliders, splitters, and drag-and-drop remain unsolved

14 Interaction Families

From buttons to rich text editors — every component type a computer-use agent might encounter

Drag/Drop & Workspace Interactions

4 types

49.5%

drag drop sortable list, drag drop between lists, kanban board drag drop, resizable columns

Continuous & High-Precision Input

8 types

59.6%

slider single, slider range, meter, progress bar

Advanced Editors

4 types

61.8%

rich text editor, code editor, markdown editor, json editor

Hierarchical Selection & Navigation

7 types

70%

menu, menubar, context menu, tree view

Date & Time

8 types

71%

date input text, date picker single, date picker range, time input text

Files, Clipboard, Downloads

5 types

72%

file upload button, file dropzone, file list manager, download trigger

Disclosure & Progressive

5 types

72.5%

accordion, collapsible disclosure, carousel, feed infinite scroll

Combobox & Autocomplete

4 types

73%

combobox editable single, combobox editable multi, autocomplete restricted, autocomplete freeform

List-based Selection (Flat)

7 types

74%

listbox single, listbox multi, select native, select custom single

Structured Data Display

7 types

75%

table static, data table sortable, data table filterable, data table paginated

Text Entry & Structured Field Input

10 types

78%

text input, textarea, password input, search input

Discrete Choice

9 types

86%

checkbox, checkbox tristate, checkbox group, radio group

Overlays & Transient UI

9 types

89%

dialog modal, alert dialog confirm, drawer, popover

Command & Navigation

10 types

91.5%

button, icon button, link, menu button

How Models Compare

Pass rates across observation spaces on ComponentBench-Full

Model	Browser-Use	AX-tree	SoM	Pixel
Gemini 3 Flash	95.2%	89.6%	87.1%	85.4%
GPT-5.4	90.4%	81.5%	77%	83.8%
Gemini 3.1 Flash-Lite	87.4%	77.7%	73.5%	63.3%
GPT-5 mini	87%	83.1%	78.5%	49%
GPT-5.4 mini	85.8%	79.1%	74.7%	77.1%
Qwen3-VL-235B	—	77%	54.4%	50.5%
UI-TARS-1.5-7B	—	—	—	12.6%