ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Srinivas Sunkara,, Victor Carbune, Jason Lin, Maria Wang, Yun Zhu, Jindong Chen

TL;DR
ScreenQA introduces a large-scale dataset of 86,000 question-answer pairs over mobile app screenshots to improve screen content understanding and automate tasks through vision-based comprehension.
Contribution
The paper presents a new benchmark dataset, ScreenQA, bridging the gap between low-level UI understanding and high-level task comprehension for screen content.
Findings
Effective benchmarking of screen reading comprehension.
Positive transfer observed to web applications.
Dataset enables multiple subtasks for diverse scenarios.
Abstract
We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems
