Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

Philip Xu

arXiv:2601.09746·cs.MA·April 8, 2026

Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts

Philip Xu

PDF

TL;DR

This paper presents MACL, a multi-agent framework that improves vision-language alignment under OOD concepts by collaborative learning and dynamic balancing, showing significant gains on the VISTA-Beyond dataset.

Contribution

Introduces a multi-agent cooperative learning framework with structured message passing and adaptive balancing to enhance cross-modal alignment under OOD conditions.

Findings

01

Achieves 1-5% precision improvements in few-shot and zero-shot tasks.

02

Effectively mitigates modality imbalance in vision-language models.

03

Demonstrates robustness across diverse visual domains.

Abstract

This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.