Personalizing Multimodal Large Language Models for Image Captioning: An   Experimental Analysis

Davide Bucciarelli; Nicholas Moratelli; Marcella Cornia; Lorenzo; Baraldi; Rita Cucchiara

arXiv:2412.03665·cs.CV·December 6, 2024·2 cites

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo, Baraldi, Rita Cucchiara

PDF

Open Access

TL;DR

This paper evaluates the potential of Multimodal Large Language Models to replace traditional image captioning systems, analyzing their zero-shot and fine-tuned performance across various benchmarks.

Contribution

It provides an experimental analysis of Multimodal LLMs for image captioning, highlighting their strengths and challenges in domain adaptation and generalization.

Findings

01

Multimodal LLMs perform well in zero-shot image captioning.

02

Fine-tuning for specific domains is challenging without losing generalization.

03

Prompt learning and adaptation methods impact performance significantly.

Abstract

The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization