NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks

Zihan Zheng; Tianle Cui; Taoran Wang; Fengtao Wang; Jiahui Pan; Lewei He; Qianglong Chen

arXiv:2508.01330·cs.AI·April 21, 2026

NaturalGAIA: A Verifiable Benchmark and Hierarchical Framework for Long-Horizon GUI Tasks

Zihan Zheng, Tianle Cui, Taoran Wang, Fengtao Wang, Jiahui Pan, Lewei He, Qianglong Chen

PDF

1 Repo

TL;DR

NaturalGAIA introduces a verifiable GUI interaction dataset and a hierarchical framework that improves task success rates and efficiency in long-horizon GUI tasks driven by language models.

Contribution

The paper presents NaturalGAIA, a new dataset grounded in real human GUI interactions, and LightManus-Jarvis, a hierarchical framework combining planning and execution for better performance.

Findings

01

Achieved a 45.6% weighted pathway success rate, outperforming the 21.1% baseline.

02

Reduced token consumption by 75% and execution time by 76%.

03

Validated the macro-planning and micro-execution paradigm for complex tasks.

Abstract

Despite significant advances in LLM-driven GUI agents, the field remains constrained by the challenge of reconciling high-fidelity realism with verifiable evaluation accuracy. To address this, we introduce NaturalGAIA, a verifiable evaluation dataset grounded in real-world human GUI interaction intents. By decoupling logical causal pathways from linguistic narratives, it rigorously simulates natural human intent, characterized by cognitive non-linearity and contextual dependencies. Furthermore, we propose LightManus-Jarvis, a hierarchical collaborative framework where LightManus manages dynamic topological planning and context evolution, while Jarvis~ensures execution precision via hybrid visual-structural perception. Experiments demonstrate that our approach achieves a Weighted Pathway Success Rate of 45.6%, significantly outperforming the state-of-the-art baseline (21.1%), while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KeLes-Coding/NatureGAIA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.