Training-Free Multimodal Large Language Model Orchestration
Tianyu Xie, Yuexiao Ma, Yuhang Wu, Wang Chen, Jiayi Ji, Tat-Seng Chua, Xiawu Zheng, Rongrong Ji

TL;DR
This paper introduces a training-free framework for integrating multimodal experts into large language models, enabling efficient, modular, and extensible omni-modal assistants without additional training.
Contribution
It proposes a novel training-free orchestration system that combines off-the-shelf multimodal experts with an LLM controller, cross-modal memory, and unified interaction layer.
Findings
Achieves strong performance on diverse multimodal benchmarks.
Maintains low orchestration overhead and modular upgradeability.
Eliminates the need for costly joint training of multimodal systems.
Abstract
Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
