TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents
Kunal Singh, Shreyas Singh, Mukund Khanna

TL;DR
TRISHUL is a training-free framework that significantly improves large vision-language models' ability to understand and interpret GUIs holistically, combining action grounding and GUI referring with spatial and semantic enhancements.
Contribution
It introduces a novel, training-free approach with hierarchical parsing and spatially enriched descriptions to enhance generalist LVLMs for comprehensive GUI understanding.
Findings
Outperforms existing methods in action grounding on multiple datasets.
Surpasses ToL in GUI referring benchmarks.
Demonstrates robustness and adaptability across diverse GUI tasks.
Abstract
Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsSelf-Organizing Map · Focus
