Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan,, Jie Yang, Yi Zhang, Xin Eric Wang

TL;DR
This paper introduces a Tree-of-Lens agent that constructs a hierarchical layout tree from GUI screenshots and user points, enabling more accurate and layout-aware screen reading and understanding across various platforms.
Contribution
The paper presents a novel Tree-of-Lens grounding mechanism and a hierarchical layout tree model for screen reading tasks, advancing beyond rigid tools with a new layout-aware approach.
Findings
Outperforms baseline models on the ScreenPR benchmark across mobile, web, and OS GUIs.
Effectively interprets layout and spatial relationships in GUI understanding.
Assists in mobile GUI navigation by identifying incorrect actions.
Abstract
Graphical User Interfaces (GUIs) are central to our interaction with digital devices and growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (ScreenPR) task. Currently, this task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the ScreenPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Interactive and Immersive Displays · Gaze Tracking and Assistive Technology
