Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

Bishoy Galoaa; Xiangyu Bai; Sarah Ostadabbas

arXiv:2601.23251·cs.CV·May 11, 2026

Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

PDF

1 Models

TL;DR

This paper introduces a new benchmark and training approach leveraging pedagogical video structure to improve visual reasoning in vision-language models, achieving significant performance gains with less data.

Contribution

It presents SoSVQA, a structured video question-answering benchmark from children's educational content, and demonstrates that leveraging pedagogical cues enhances reasoning capabilities in VLMs.

Findings

01

Training on 10K QA pairs yields substantial performance improvements.

02

Structured pedagogical content compensates for smaller training data.

03

Models achieve competitive results on multiple reasoning benchmarks.

Abstract

State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily. We hypothesize that the explicit pedagogical structure, specifically the context-question-pause-answer cycles embedded in children's educational video, provides naturally co-aligned reasoning traces: temporally synchronized visual cues, questions, and answers that emerge only from deliberate pedagogical authoring and cannot be practically reconstructed through manual annotation at scale. To test this, we introduce SoSVQA (Structure over Scale Visual Question Answering), a unified benchmark of 10K question-answer pairs automatically extracted from Dora the Explorer (DoraVQA) and Mickey Mouse Clubhouse (ClubHVQA) with precise timestamp alignment, and fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bishoygaloaa/Qween
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.