TL;DR
NaturalGAIA introduces a verifiable GUI interaction dataset and a hierarchical framework that improves task success rates and efficiency in long-horizon GUI tasks driven by language models.
Contribution
The paper presents NaturalGAIA, a new dataset grounded in real human GUI interactions, and LightManus-Jarvis, a hierarchical framework combining planning and execution for better performance.
Findings
Achieved a 45.6% weighted pathway success rate, outperforming the 21.1% baseline.
Reduced token consumption by 75% and execution time by 76%.
Validated the macro-planning and micro-execution paradigm for complex tasks.
Abstract
Despite significant advances in LLM-driven GUI agents, the field remains constrained by the challenge of reconciling high-fidelity realism with verifiable evaluation accuracy. To address this, we introduce NaturalGAIA, a verifiable evaluation dataset grounded in real-world human GUI interaction intents. By decoupling logical causal pathways from linguistic narratives, it rigorously simulates natural human intent, characterized by cognitive non-linearity and contextual dependencies. Furthermore, we propose LightManus-Jarvis, a hierarchical collaborative framework where LightManus manages dynamic topological planning and context evolution, while Jarvis~ensures execution precision via hybrid visual-structural perception. Experiments demonstrate that our approach achieves a Weighted Pathway Success Rate of 45.6%, significantly outperforming the state-of-the-art baseline (21.1%), while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
