TRISHUL: Towards Region Identification and Screen Hierarchy   Understanding for Large VLM based GUI Agents

Kunal Singh; Shreyas Singh; Mukund Khanna

arXiv:2502.08226·cs.CV·February 17, 2025

TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents

Kunal Singh, Shreyas Singh, Mukund Khanna

PDF

Open Access 5 Models

TL;DR

TRISHUL is a training-free framework that significantly improves large vision-language models' ability to understand and interpret GUIs holistically, combining action grounding and GUI referring with spatial and semantic enhancements.

Contribution

It introduces a novel, training-free approach with hierarchical parsing and spatially enriched descriptions to enhance generalist LVLMs for comprehensive GUI understanding.

Findings

01

Outperforms existing methods in action grounding on multiple datasets.

02

Surpasses ToL in GUI referring benchmarks.

03

Demonstrates robustness and adaptability across diverse GUI tasks.

Abstract

Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsSelf-Organizing Map · Focus