SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

Yuanyang Yin; Yaqi Zhao; Yajie Zhang; Yuanxing Zhang; Ke Lin; Jiahao Wang; Xin Tao; Pengfei Wan; Wentao Zhang; Feng Zhao

arXiv:2408.11813·cs.CV·September 8, 2025

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

Yuanyang Yin, Yaqi Zhao, Yajie Zhang, Yuanxing Zhang, Ke Lin, Jiahao Wang, Xin Tao, Pengfei Wan, Wentao Zhang, Feng Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces SEA, a token-level supervised embedding alignment method that enhances visual-textual integration in multimodal large language models, especially benefiting smaller models by significantly improving their cross-modal understanding.

Contribution

SEA provides a novel token-level supervision approach for better modality alignment, outperforming traditional adapter-based methods with minimal additional computational cost.

Findings

01

SEA improves performance across various model sizes.

02

Smaller models see an average gain of 7.61%.

03

SEA enhances cross-modal understanding significantly.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities by integrating visual and textual inputs, yet modality alignment remains one of the most challenging aspects. Current MLLMs typically rely on simple adapter architectures and pretraining approaches to bridge vision encoders with large language models (LLM), guided by image-level supervision. We identify this paradigm often leads to suboptimal alignment between modalities, significantly constraining the LLM's ability to properly interpret and reason with visual features particularly for smaller language models. This limitation degrades overall performance-particularly for smaller language models where capacity constraints are more pronounced and adaptation capabilities are limited. To address this fundamental limitation, we propose Supervised Embedding Alignment (SEA), a token-level supervision alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsAdapter · ALIGN · Contrastive Language-Image Pre-training