ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Yu-Chung Hsiao; Fedir Zubach; Gilles Baechler; Srinivas Sunkara,; Victor Carbune; Jason Lin; Maria Wang; Yun Zhu; Jindong Chen

arXiv:2209.08199·cs.CL·February 11, 2025·5 cites

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots

Yu-Chung Hsiao, Fedir Zubach, Gilles Baechler, Srinivas Sunkara,, Victor Carbune, Jason Lin, Maria Wang, Yun Zhu, Jindong Chen

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

ScreenQA introduces a large-scale dataset of 86,000 question-answer pairs over mobile app screenshots to improve screen content understanding and automate tasks through vision-based comprehension.

Contribution

The paper presents a new benchmark dataset, ScreenQA, bridging the gap between low-level UI understanding and high-level task comprehension for screen content.

Findings

01

Effective benchmarking of screen reading comprehension.

02

Positive transfer observed to web applications.

03

Dataset enables multiple subtasks for diverse scenarios.

Abstract

We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research-datasets/screen_qa
noneOfficial

Datasets

rootsautomation/RICO-ScreenQA
dataset· 242 dl
242 dl

Videos

ScreenQA: Large-Scale Question-Answer Pairs Over Mobile App Screenshots· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Expert finding and Q&A systems