TL;DR
FineState-Bench is a comprehensive benchmark for evaluating fine-grained, state-conditioned GUI interactions across multiple platforms, highlighting significant challenges and room for improvement in visual grounding accuracy.
Contribution
The paper introduces a new benchmark, diagnostic pipeline, and visual assistant to evaluate and analyze fine-grained GUI state-setting tasks in vision-language models.
Findings
Exact goal-state success rates are low, with a maximum of 32.8% on Web.
VDA localization hints improve success rates by approximately 15 points.
Current models still struggle with reliable fine-grained state-conditioned interactions.
Abstract
Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
