Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, Yang Li

TL;DR
Screen2Words is a multimodal learning approach that automatically generates concise language descriptions of mobile UI screens, leveraging a large dataset and deep models to improve understanding and communication of UI content.
Contribution
The paper introduces a novel multimodal learning method for mobile UI summarization, supported by a large-scale annotated dataset and comprehensive model evaluations.
Findings
High-quality summaries generated by the models
Large-scale dataset with over 112k annotations
Effective multimodal learning approach for UI understanding
Abstract
Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase. Summarizing mobile screens requires a holistic understanding of the multi-modal data of mobile UIs, including text, image, structures as well as UI semantics, motivating our multi-modal learning approach. We collected and analyzed a large-scale screen summarization dataset annotated by human workers. Our dataset contains more than 112k language summarization across 22k unique UI screens. We then experimented with a set of deep models with different configurations. Our evaluation of these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/paligemma-3b-pt-224model· 86k dl· ♡ 42686k dl♡ 426
- 🤗google/paligemma-3b-mix-448model· 2.9k dl· ♡ 1162.9k dl♡ 116
- 🤗google/paligemma-3b-pt-224-jaxmodel· 205 dl· ♡ 3205 dl♡ 3
- 🤗google/paligemma-3b-pt-448-jaxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗google/paligemma-3b-pt-896-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-aokvqa-mc-448-jaxmodel
- 🤗google/paligemma-3b-ft-textcaps-224-jaxmodel
- 🤗google/paligemma-3b-ft-widgetcap-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-vqav2-448-jaxmodel· 1 dl· ♡ 21 dl♡ 2
- 🤗google/paligemma-3b-ft-refcoco-seg-448-jaxmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · AI in Service Interactions · Topic Modeling
