MultiModal Fine-tuning with Synthetic Captions
Shohei Enomoto, Shin'ya Yamaguchi

TL;DR
This paper introduces a method to enhance fine-tuning of neural networks by generating synthetic captions for unimodal datasets, enabling effective multimodal training and improving classification performance, especially in few-shot learning.
Contribution
It proposes a novel approach using synthetic captions and contrastive loss to bridge the gap between pre-training and fine-tuning in multimodal learning.
Findings
Outperforms baseline methods on 13 image classification benchmarks.
Significant improvements observed in few-shot learning scenarios.
Demonstrates the effectiveness of synthetic captions in dataset enhancement.
Abstract
In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
