Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis
Davide Bucciarelli, Nicholas Moratelli, Marcella Cornia, Lorenzo, Baraldi, Rita Cucchiara

TL;DR
This paper evaluates the potential of Multimodal Large Language Models to replace traditional image captioning systems, analyzing their zero-shot and fine-tuned performance across various benchmarks.
Contribution
It provides an experimental analysis of Multimodal LLMs for image captioning, highlighting their strengths and challenges in domain adaptation and generalization.
Findings
Multimodal LLMs perform well in zero-shot image captioning.
Fine-tuning for specific domains is challenging without losing generalization.
Prompt learning and adaptation methods impact performance significantly.
Abstract
The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
