EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Sheng Zhou; Junbin Xiao; Qingyun Li; Yicong Li; Xun Yang; Dan Guo,; Meng Wang; Tat-Seng Chua; Angela Yao

arXiv:2502.07411·cs.CV·March 24, 2025

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo,, Meng Wang, Tat-Seng Chua, Angela Yao

PDF

Open Access 1 Repo

TL;DR

EgoTextVQA introduces a new benchmark dataset for egocentric scene-text question answering in videos, revealing current models' limitations and emphasizing the need for improved temporal reasoning and high-resolution inputs.

Contribution

The paper presents EgoTextVQA, a comprehensive dataset and evaluation framework for egocentric scene-text QA, highlighting the challenges and proposing directions for future research.

Findings

01

Current models achieve only around 33% accuracy.

02

Precise temporal grounding improves performance.

03

High-resolution and multi-frame reasoning are crucial.

Abstract

We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhousheng97/egotextvqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques

MethodsAttentive Walk-Aggregating Graph Neural Network