Multimodal Icon Annotation For Mobile Applications
Xiaoxue Zang, Ying Xu, Jindong Chen

TL;DR
This paper introduces a multi-modal deep learning approach that combines pixel and view hierarchy features to improve icon annotation in mobile UIs, outperforming existing methods.
Contribution
The study presents a novel multi-modal method that leverages both pixel data and view hierarchy information for UI element annotation, along with a new annotated dataset.
Findings
Outperforms baseline object classification models
Effective combination of view hierarchy and pixel features
Provides a new high-quality annotated UI dataset
Abstract
Annotating user interfaces (UIs) that involves localization and classification of meaningful UI elements on a screen is a critical step for many mobile applications such as screen readers and voice control of devices. Annotating object icons, such as menu, search, and arrow backward, is especially challenging due to the lack of explicit labels on screens, their similarity to pictures, and their diverse shapes. Existing studies either use view hierarchy or pixel based methods to tackle the task. Pixel based approaches are more popular as view hierarchy features on mobile platforms are often incomplete or inaccurate, however it leaves out instructional information in the view hierarchy such as resource-ids or content descriptions. We propose a novel deep learning based multi-modal approach that combines the benefits of both pixel and view hierarchy features as well as leverages the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Gaze Tracking and Assistive Technology · Advanced Image and Video Retrieval Techniques
