Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Ziyang Luo; Nian Liu; Junwei Han

arXiv:2604.14520·cs.CV·April 17, 2026

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

Ziyang Luo, Nian Liu, Junwei Han

PDF

TL;DR

This paper introduces Chain of Modality (CoM), a dynamic framework for multimodal fusion in Omni-MLLMs that overcomes static fusion limitations by adaptively orchestrating input topologies and bifurcating cognitive pathways.

Contribution

It proposes CoM, a novel agentic framework that dynamically switches fusion topologies and separates perception and reasoning pathways, improving robustness and generalization.

Findings

01

CoM neutralizes structural biases in multimodal fusion.

02

CoM achieves robust performance across diverse benchmarks.

03

CoM operates effectively in training-free and data-efficient settings.

Abstract

Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.