Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents
Jinjie Wei, Jiyao Liu, Lihao Liu, Ming Hu, Junzhi Ning, Mingcheng Li, Weijie Yin, Junjun He, Xiao Liang, Chao Feng, Dingkang Yang

TL;DR
This paper introduces CogniGUI, a dual-system framework for GUI agents inspired by Kahneman's theory, enabling adaptive learning and improved performance in complex, real-world GUI interactions.
Contribution
The paper presents CogniGUI, combining hierarchical visual parsing and a relative policy optimization agent, along with ScreenSeek, a new benchmark for evaluating GUI agent adaptability.
Findings
CogniGUI outperforms existing methods on GUI benchmarks.
The dual-system approach enhances learning and decision-making.
ScreenSeek benchmark reveals strengths and limitations of current GUI agents.
Abstract
Graphical User Interface (GUI) agents have made significant progress in automating digital tasks through the utilization of computer vision and language models. Nevertheless, existing agent systems encounter notable limitations. Firstly, they predominantly depend on trial and error decision making rather than progressive reasoning, thereby lacking the capability to learn and adapt from interactive encounters. Secondly, these systems are assessed using overly simplistic single step accuracy metrics, which do not adequately reflect the intricate nature of real world GUI interactions. In this paper, we present CogniGUI, a cognitive framework developed to overcome these limitations by enabling adaptive learning for GUI automation resembling human-like behavior. Inspired by Kahneman's Dual Process Theory, our approach combines two main components: (1) an omni parser engine that conducts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
