TL;DR
InternVideo-Next introduces a novel two-stage pretraining framework that enhances semantic understanding in video models without relying on video-text supervision, achieving state-of-the-art results.
Contribution
It proposes an Encoder-Predictor-Decoder architecture and a two-stage training scheme with semantic priors to improve video representation learning.
Findings
State-of-the-art performance on multiple benchmarks.
Effective semantic and detail preservation in video representations.
Scalable approach using unlabeled videos.
Abstract
Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
