GhostUI: Unveiling Hidden Interactions in Mobile UI
Minkyu Kweon, Seokhyeon Park, Soohyun Lee, You Been Lee, Jeongmin Rhee, Jinwook Seo

TL;DR
GhostUI introduces a comprehensive dataset to improve the detection of hidden, gesture-based interactions in mobile apps, enhancing the capabilities of vision language models for better automation and user experience understanding.
Contribution
The paper presents GhostUI, a novel dataset with rich annotations that enables vision language models to better detect concealed interactions in mobile UIs, advancing mobile automation research.
Findings
Models fine-tuned on GhostUI outperform baselines in hidden interaction detection.
GhostUI improves post-interaction screen prediction accuracy.
Enhanced recognition of implicit gestures in mobile interfaces.
Abstract
Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Hand Gesture Recognition Systems
