MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Shoubin Yu; Yue Zhang; Ziyang Wang; Jaehong Yoon; Mohit Bansal

arXiv:2506.17113·cs.CV·October 28, 2025

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal

PDF

1 Video

TL;DR

MEXA is a training-free framework that dynamically aggregates specialized expert models for effective multimodal reasoning across diverse domains, improving performance without additional training.

Contribution

Introducing MEXA, a modular, task- and modality-aware aggregation framework that enables flexible multimodal reasoning without extra training overhead.

Findings

01

Outperforms strong multimodal baselines on various benchmarks.

02

Effectively handles diverse modalities like video, audio, and medical data.

03

Provides interpretable reasoning outputs from expert models.

Abstract

Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation· underline