Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh

TL;DR
Agent-Omni is a flexible, modular framework that coordinates existing foundation models for robust, multi-modal reasoning across text, images, audio, and video without retraining.
Contribution
It introduces a master-agent system that enables cross-modal reasoning by coordinating pre-existing models, avoiding costly fine-tuning and supporting diverse modalities.
Findings
Achieves state-of-the-art performance on multi-modal benchmarks.
Demonstrates effective cross-modal reasoning without retraining.
Ensures transparency and interpretability in multi-modal tasks.
Abstract
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
