Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

Longhui Ma; Di Zhao; Siwei Wang; Zhao Lv; Miao Wang

arXiv:2602.06351·cs.AI·February 9, 2026

Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang

PDF

Open Access

TL;DR

Trifuse is a novel attention-based GUI grounding framework that explicitly integrates spatial anchors and multimodal cues, achieving strong performance without fine-tuning and reducing dependence on large annotated datasets.

Contribution

The paper introduces Trifuse, a new multimodal fusion strategy that enhances attention-based GUI grounding by explicitly incorporating spatial anchors, OCR cues, and caption semantics.

Findings

01

Achieves strong performance on four benchmarks without fine-tuning.

02

Incorporating OCR and caption cues improves grounding accuracy.

03

Reduces reliance on large annotated GUI datasets.

Abstract

GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques