TL;DR
This paper introduces a new benchmark and training approach leveraging pedagogical video structure to improve visual reasoning in vision-language models, achieving significant performance gains with less data.
Contribution
It presents SoSVQA, a structured video question-answering benchmark from children's educational content, and demonstrates that leveraging pedagogical cues enhances reasoning capabilities in VLMs.
Findings
Training on 10K QA pairs yields substantial performance improvements.
Structured pedagogical content compensates for smaller training data.
Models achieve competitive results on multiple reasoning benchmarks.
Abstract
State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily. We hypothesize that the explicit pedagogical structure, specifically the context-question-pause-answer cycles embedded in children's educational video, provides naturally co-aligned reasoning traces: temporally synchronized visual cues, questions, and answers that emerge only from deliberate pedagogical authoring and cannot be practically reconstructed through manual annotation at scale. To test this, we introduce SoSVQA (Structure over Scale Visual Question Answering), a unified benchmark of 10K question-answer pairs automatically extracted from Dora the Explorer (DoraVQA) and Mickey Mouse Clubhouse (ClubHVQA) with precise timestamp alignment, and fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
