On Domain-Adaptive Post-Training for Multimodal Large Language Models

Daixuan Cheng; Shaohan Huang; Ziyu Zhu; Xintong Zhang; Wayne Xin Zhao; Zhongzhi Luan; Bo Dai; Zhenliang Zhang

arXiv:2411.19930·cs.CL·August 28, 2025

On Domain-Adaptive Post-Training for Multimodal Large Language Models

Daixuan Cheng, Shaohan Huang, Ziyu Zhu, Xintong Zhang, Wayne Xin Zhao, Zhongzhi Luan, Bo Dai, Zhenliang Zhang

PDF

Open Access 10 Models 5 Datasets

TL;DR

This paper presents a systematic approach for domain adaptation of multimodal large language models through post-training, emphasizing data synthesis, training strategies, and extensive domain-specific evaluations.

Contribution

It introduces a generate-then-filter data synthesis pipeline, demonstrates the effectiveness of single-stage training for domain adaptation, and provides comprehensive evaluations across multiple high-impact domains.

Findings

01

Generated domain-specific data outperforms manual and closed-source methods.

02

Single-stage training surpasses two-stage approaches for domain adaptation.

03

Extensive experiments validate improved performance in biomedicine, food, and remote sensing.

Abstract

Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs via post-training, focusing on data synthesis, training pipeline, and task evaluation. (1) Data Synthesis: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models in enhancing domain-specific performance. (2) Training Pipeline: Unlike general MLLMs that typically adopt a two-stage training paradigm, we find that a single-stage approach is more effective for domain adaptation. (3) Task Evaluation: We conduct extensive experiments in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsByte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Residual Connection · Adam · Attention Is All You Need · Softmax · Label Smoothing · Dropout · Linear Layer