Scene-Text Grounding for Text-Based Video Question Answering

Sheng Zhou; Junbin Xiao; Xun Yang; Peipei Song; Dan Guo; Angela Yao; Meng Wang; Tat-Seng Chua

arXiv:2409.14319·cs.CV·May 20, 2025

Scene-Text Grounding for Text-Based Video Question Answering

Sheng Zhou, Junbin Xiao, Xun Yang, Peipei Song, Dan Guo, Angela Yao, Meng Wang, Tat-Seng Chua

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new task called Grounded TextVideoQA that emphasizes interpretability by localizing scene-text regions relevant to questions, and proposes a model and dataset to advance research in this area.

Contribution

It defines Grounded TextVideoQA, proposes the T2S-QA model with contrastive learning, and provides the ViTXT-GQA dataset for evaluation and analysis.

Findings

01

Existing methods have severe limitations in Grounded TextVideoQA.

02

T2S-QA outperforms previous techniques but still lags behind human performance.

03

Scene-text recognition is identified as the major challenge.

Abstract

Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhousheng97/vitxt-gqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Video Analysis and Summarization

MethodsContrastive Learning