Multi-modal user interface control detection using cross-attention
Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari

TL;DR
This paper presents a multi-modal extension of YOLOv5 that incorporates GPT-generated text descriptions via cross-attention to improve UI control detection in screenshots, especially in ambiguous cases.
Contribution
It introduces a novel multi-modal detection framework combining visual and textual data, demonstrating significant performance improvements over baseline models.
Findings
Convolutional fusion achieved the best detection performance.
Multi-modal approach improved detection of semantically complex UI controls.
Significant gains in edge cases with ambiguous visual features.
Abstract
Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
