MultiModal Fine-tuning with Synthetic Captions

Shohei Enomoto; Shin'ya Yamaguchi

arXiv:2601.21426·cs.CV·January 30, 2026

MultiModal Fine-tuning with Synthetic Captions

Shohei Enomoto, Shin'ya Yamaguchi

PDF

Open Access

TL;DR

This paper introduces a method to enhance fine-tuning of neural networks by generating synthetic captions for unimodal datasets, enabling effective multimodal training and improving classification performance, especially in few-shot learning.

Contribution

It proposes a novel approach using synthetic captions and contrastive loss to bridge the gap between pre-training and fine-tuning in multimodal learning.

Findings

01

Outperforms baseline methods on 13 image classification benchmarks.

02

Significant improvements observed in few-shot learning scenarios.

03

Demonstrates the effectiveness of synthetic captions in dataset enhancement.

Abstract

In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis