Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

Li Li; Yongliang Wu; Jingze Zhu; Jiawei Peng; Jianfei Cai; Xu Yang

arXiv:2507.08021·cs.CL·July 14, 2025

Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

Li Li, Yongliang Wu, Jingze Zhu, Jiawei Peng, Jianfei Cai, Xu Yang

PDF

TL;DR

This paper investigates how different demonstration configurations affect multimodal in-context learning in image captioning, combining external strategies and internal attention analysis to enhance understanding and performance of large multimodal models.

Contribution

It introduces a comprehensive external and internal analysis framework for multimodal ICL, including new metrics and insights into attention behaviors and configuration strategies.

Findings

01

Demonstration configuration impacts model performance significantly.

02

Attention-based metrics reveal characteristic patterns in model behavior.

03

External and internal analyses provide dual insights into multimodal ICL.

Abstract

The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.