MMOne: Representing Multiple Modalities in One Scene

Zhifeng Gu; Bing Wang

arXiv:2507.11129·cs.CV·July 18, 2025

MMOne: Representing Multiple Modalities in One Scene

Zhifeng Gu, Bing Wang

PDF

Open Access 1 Repo

TL;DR

MMOne introduces a unified framework for representing multiple modalities in a scene, effectively handling modality conflicts and enabling scalable, compact, and enhanced multimodal scene understanding.

Contribution

The paper proposes a novel multimodal scene representation framework with a modality indicator and decomposition mechanism, addressing property and granularity disparities among modalities.

Findings

01

Improves multimodal scene representation capability

02

Scalable to additional modalities

03

Enhances modality-specific and shared information disentanglement

Abstract

Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neal2020github/mmone
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques