Watch Before You Answer: Learning from Visually Grounded Post-Training

Yuxuan Zhang; EunJeong Hwang; Huaisong Zhang; Penghui Du; Yiming Jia; Dongfu Jiang; Xuan He; Shenhui Zhang; Ping Nie; Peter West; and Kelsey R. Allen

arXiv:2604.05117·cs.CV·April 8, 2026

Watch Before You Answer: Learning from Visually Grounded Post-Training

Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, and Kelsey R. Allen

PDF

2 Repos

TL;DR

This paper identifies that many video understanding benchmarks contain questions answerable by text alone and introduces VidGround, a data curation method that improves vision-language model performance by focusing on visually grounded questions.

Contribution

The authors propose VidGround, a simple data curation technique that enhances post-training for video understanding by emphasizing visually grounded questions, outperforming complex methods.

Findings

01

VidGround improves performance by up to 6.2 points.

02

Using only 69.1% of data yields comparable or better results.

03

Data quality and proper curation are key to advancing VLMs.

Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.