OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang; Xiang An; Yunyao Yan; Yin Xie; Bin Qin; Kaicheng Yang; Yifei Shen; Yuanhan Zhang; Chunyuan Li; Shikun Feng; Changrui Chen; Huajie Tan; Ming Hu; Manyuan Zhang; Bo Li; Ziyong Feng; Ziwei Liu; Zongyuan Ge; and Jiankang Deng

arXiv:2602.08683·cs.CV·February 27, 2026

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng

PDF

Open Access

TL;DR

OneVision-Encoder introduces a novel video encoding approach that focuses computation on signal-rich regions, aligning with information-theoretic principles, leading to improved efficiency and accuracy across diverse visual understanding tasks.

Contribution

It proposes Codec Patchification and a unified 3D reasoning framework, significantly enhancing visual processing efficiency and performance compared to existing models.

Findings

01

Outperforms strong vision backbones on 16 benchmarks

02

Achieves 4.1% improvement on video understanding tasks

03

Uses substantially fewer visual tokens and pretraining data

Abstract

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition