MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces

Shaojun E; Yuchen Yang; Jiaheng Wu; Yan Zhang; Tiejun Zhao; Ziyan Chen

arXiv:2507.21741·cs.CV·July 30, 2025

MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces

Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen

PDF

TL;DR

MAGE is a novel multimodal framework that improves visual and semantic alignment between vision and language models, enhancing generation capabilities and performance across multiple benchmarks.

Contribution

MAGE introduces an innovative alignment mechanism with IAN and a training strategy to bridge semantic gaps, expanding multimodal model capabilities.

Findings

01

Significantly outperforms similar models on MME, MMBench, and SEED benchmarks.

02

Achieves better semantic and dimensional alignment between visual and textual data.

03

Enhances 'Any-to-Any' multimodal generation capabilities.

Abstract

In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.