STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning

Zinuo Li; Yongxin Guo; Jun Liu; Jiawei Zhan; Xi Jiang; Chengjie Wang; Mohammed Bennamoun; Farid Boussaid; Feng Zheng; Qiuhong Ke

arXiv:2604.04415·cs.CL·May 8, 2026

STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning

Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke

PDF

1 Repo

TL;DR

This paper introduces STEER, a structured event-based video reasoning framework with a new dataset and a multi-objective RL training method, achieving competitive performance with fewer frames.

Contribution

It proposes a novel structured event evidence representation, a new dataset STEER-60K, and a Pareto-based multi-objective RL training approach for video reasoning.

Findings

01

STEER-4B outperforms 7B baselines on video understanding tasks.

02

The dataset enables effective evidence-grounded reasoning.

03

The Pareto-Frontier guided Advantage Balancing improves training stability.

Abstract

Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.