TL;DR
This paper introduces STEER, a structured event-based video reasoning framework with a new dataset and a multi-objective RL training method, achieving competitive performance with fewer frames.
Contribution
It proposes a novel structured event evidence representation, a new dataset STEER-60K, and a Pareto-based multi-objective RL training approach for video reasoning.
Findings
STEER-4B outperforms 7B baselines on video understanding tasks.
The dataset enables effective evidence-grounded reasoning.
The Pareto-Frontier guided Advantage Balancing improves training stability.
Abstract
Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
