Reading Between the Lanes: Text VideoQA on the Road
George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas, C.V. Jawahar

TL;DR
This paper introduces RoadTextVQA, a new dataset of driving videos with questions about road signs and text, aiming to improve video question answering for driver assistance systems.
Contribution
The paper presents RoadTextVQA, a novel dataset for VideoQA focused on road sign recognition in driving videos, facilitating research in in-vehicle support and multimodal reasoning.
Findings
State-of-the-art models perform poorly on the dataset
The dataset contains 3,222 videos and 10,500 questions
Highlighting the need for improved VideoQA methods for driving scenarios
Abstract
Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of driving videos collected from multiple countries, annotated with questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
