DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

Jiahe Zhao; Rongkun Zheng; Yi Wang; Helin Wang; Hengshuang Zhao

arXiv:2507.10302·cs.CV·July 15, 2025

DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs

Jiahe Zhao, Rongkun Zheng, Yi Wang, Helin Wang, Hengshuang Zhao

PDF

Open Access

TL;DR

DisCo introduces a novel visual encapsulation method for video MLLMs that enhances semantic distinctness and temporal coherence of visual tokens, leading to improved performance on video understanding benchmarks.

Contribution

DisCo presents a new approach combining a Visual Concept Discriminator and a Temporal Focus Calibrator to improve visual token quality in video MLLMs.

Findings

01

Outperforms previous state-of-the-art methods on multiple benchmarks.

02

Achieves higher token efficiency by reducing semantic indistinctness.

03

Enhances temporal coherence of visual tokens across video frames.

Abstract

In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Advanced Vision and Imaging · Advanced Optical Imaging Technologies