Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar

TL;DR
This paper analyzes datasets for text-based video question answering, evaluates models like BERT-QA, and explores domain adaptation challenges in understanding video scenes through textual content.
Contribution
It provides a detailed analysis of NewsVideoQA and M4-ViteVQA datasets, evaluates text-only models, and investigates cross-dataset domain adaptation.
Findings
BERT-QA performs comparably to specialized methods on these datasets.
The datasets may have limitations in visual and multi-frame understanding.
Cross-domain training shows potential but also challenges.
Abstract
Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
