Multi-modal user interface control detection using cross-attention

Milad Moradi; Ke Yan; David Colwell; Matthias Samwald; Rhona Asgari

arXiv:2604.06934·cs.CV·April 9, 2026

Multi-modal user interface control detection using cross-attention

Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari

PDF

TL;DR

This paper presents a multi-modal extension of YOLOv5 that incorporates GPT-generated text descriptions via cross-attention to improve UI control detection in screenshots, especially in ambiguous cases.

Contribution

It introduces a novel multi-modal detection framework combining visual and textual data, demonstrating significant performance improvements over baseline models.

Findings

01

Convolutional fusion achieved the best detection performance.

02

Multi-modal approach improved detection of semantically complex UI controls.

03

Significant gains in edge cases with ambiguous visual features.

Abstract

Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.