Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
Xinyao Zhang, Rui Wang, Jinhao Cui, Haotian Huang, Wei Xue, Wenhua Hu, Jianwen Xiang, and Rui Hao

TL;DR
This paper introduces an end-to-end multimodal reasoning framework for proactive GUI defect detection in multi-window mobile scenarios, outperforming existing passive, full-screen-focused methods.
Contribution
It presents a novel proactive detection framework utilizing multimodal large language models and a new benchmark for multi-window GUI defect analysis.
Findings
Multi-window scenarios increase layout-related defects by 184%.
The method detects 40 defect-prone apps with 10% false positives.
Achieves 87.2% F1 score in widget occlusion detection.
Abstract
Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
