SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement

Weijie Yin; Dingkang Yang; Hongyuan Dong; Zijian Kang; Jiacong Wang; Xiao Liang; Chao Feng; Jiao Ran

arXiv:2507.01643·cs.CV·July 3, 2025

SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement

Weijie Yin, Dingkang Yang, Hongyuan Dong, Zijian Kang, Jiacong Wang, Xiao Liang, Chao Feng, Jiao Ran

PDF

Open Access 2 Models

TL;DR

SAILViT introduces a gradual feature refinement approach for Vision Transformers to enhance robustness and generalizability in multimodal large language models, leading to significant performance improvements across various tasks.

Contribution

The paper proposes SAILViT, a novel ViT architecture with coarse-to-fine feature alignment and knowledge infusion, addressing challenges in multimodal training with LLMs.

Findings

01

SAILViT improves robustness across different model sizes and architectures.

02

Enhanced performance on the OpenCompass benchmark across multiple downstream tasks.

03

Thorough empirical analysis confirms generalizability and effectiveness.

Abstract

Vision Transformers (ViTs) are essential as foundation backbones in establishing the visual comprehension capabilities of Multimodal Large Language Models (MLLMs). Although most ViTs achieve impressive performance through image-text pair-based contrastive learning or self-supervised mechanisms, they struggle to engage in connector-based co-training directly with LLMs due to potential parameter initialization conflicts and modality semantic gaps. To address the above challenges, this paper proposes SAILViT, a gradual feature learning-enhanced ViT for facilitating MLLMs to break through performance bottlenecks in complex multimodal interactions. SAILViT achieves coarse-to-fine-grained feature alignment and world knowledge infusion with gradual feature refinement, which better serves target training demands. We perform thorough empirical analyses to confirm the powerful robustness and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications