Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models
Liam Bennett, Mason Clark, Lucas Anderson, Hana Satou, Olivia Martinez

TL;DR
This paper introduces MA-AFS, a dynamic fusion framework for multimodal models that adaptively emphasizes more reliable modalities per instance, improving robustness and generalization across vision-language tasks.
Contribution
It proposes a novel neural scheduler for adaptive modality fusion, integrating entropy and agreement cues, enhancing robustness without significantly increasing model complexity.
Findings
Achieves consistent performance improvements over strong baselines.
Enhances robustness under modality noise and corruption.
Improves generalization under domain shifts.
Abstract
Multimodal foundation models have achieved impressive progress across a wide range of vision-language tasks. However, existing approaches often adopt fixed or task-specific fusion strategies, neglecting the intrinsic variability of modality reliability and sample complexity. In this paper, we propose Modality-Aware Adaptive Fusion Scheduling (MA-AFS), a general framework that learns to dynamically modulate the contribution of each modality on a per-instance basis. MA-AFS introduces a lightweight neural scheduler that predicts modality fusion weights by integrating visual and textual entropy signals along with cross-modal agreement cues. This enables the model to adaptively emphasize more reliable modalities, especially under noisy, missing, or misaligned inputs. We formulate the fusion process as a differentiable scheduling mechanism, analyze its theoretical consistency and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Multi-Agent Systems and Negotiation
