Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Ziyang Luo, Nian Liu, Junwei Han

TL;DR
This paper introduces Chain of Modality (CoM), a dynamic framework for multimodal fusion in Omni-MLLMs that overcomes static fusion limitations by adaptively orchestrating input topologies and bifurcating cognitive pathways.
Contribution
It proposes CoM, a novel agentic framework that dynamically switches fusion topologies and separates perception and reasoning pathways, improving robustness and generalization.
Findings
CoM neutralizes structural biases in multimodal fusion.
CoM achieves robust performance across diverse benchmarks.
CoM operates effectively in training-free and data-efficient settings.
Abstract
Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
