Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens   Grounding

Yue Fan; Lei Ding; Ching-Chen Kuo; Shan Jiang; Yang Zhao; Xinze Guan,; Jie Yang; Yi Zhang; Xin Eric Wang

arXiv:2406.19263·cs.CL·October 29, 2024·1 cites

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan,, Jie Yang, Yi Zhang, Xin Eric Wang

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper introduces a Tree-of-Lens agent that constructs a hierarchical layout tree from GUI screenshots and user points, enabling more accurate and layout-aware screen reading and understanding across various platforms.

Contribution

The paper presents a novel Tree-of-Lens grounding mechanism and a hierarchical layout tree model for screen reading tasks, advancing beyond rigid tools with a new layout-aware approach.

Findings

01

Outperforms baseline models on the ScreenPR benchmark across mobile, web, and OS GUIs.

02

Effectively interprets layout and spatial relationships in GUI understanding.

03

Assists in mobile GUI navigation by identifying incorrect actions.

Abstract

Graphical User Interfaces (GUIs) are central to our interaction with digital devices and growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (ScreenPR) task. Currently, this task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the ScreenPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eric-ai-lab/Screen-Point-and-Read
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Interactive and Immersive Displays · Gaze Tracking and Assistive Technology