InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Chenting Wang; Yuhan Zhu; Yicheng Xu; Jiange Yang; Lang Lin; Ziang Yan; Yali Wang; Yi Wang; Limin Wang

arXiv:2512.01342·cs.CV·March 31, 2026

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Chenting Wang, Yuhan Zhu, Yicheng Xu, Jiange Yang, Lang Lin, Ziang Yan, Yali Wang, Yi Wang, Limin Wang

PDF

3 Models

TL;DR

InternVideo-Next introduces a novel two-stage pretraining framework that enhances semantic understanding in video models without relying on video-text supervision, achieving state-of-the-art results.

Contribution

It proposes an Encoder-Predictor-Decoder architecture and a two-stage training scheme with semantic priors to improve video representation learning.

Findings

01

State-of-the-art performance on multiple benchmarks.

02

Effective semantic and detail preservation in video representations.

03

Scalable approach using unlabeled videos.

Abstract

Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.