MMOne: Representing Multiple Modalities in One Scene
Zhifeng Gu, Bing Wang

TL;DR
MMOne introduces a unified framework for representing multiple modalities in a scene, effectively handling modality conflicts and enabling scalable, compact, and enhanced multimodal scene understanding.
Contribution
The paper proposes a novel multimodal scene representation framework with a modality indicator and decomposition mechanism, addressing property and granularity disparities among modalities.
Findings
Improves multimodal scene representation capability
Scalable to additional modalities
Enhances modality-specific and shared information disentanglement
Abstract
Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
