Reinforcing Consistency in Video MLLMs with Structured Rewards
Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang

TL;DR
This paper introduces a structured reward framework for training multimodal large language models in video understanding, improving factual accuracy and temporal consistency over standard sentence-level rewards.
Contribution
It proposes a novel structured reward approach that decomposes video understanding into factual and temporal units, enhancing model faithfulness and grounding accuracy.
Findings
Structured rewards improve factual and temporal grounding.
Models trained with structured rewards outperform baseline on multiple benchmarks.
The approach enhances faithfulness in video captioning and question answering.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
