A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models
Hao Huang, Shuaihang Yuan, Yu Hao, Congcong Wen, Yi Fang

TL;DR
This paper introduces a chain-of-thought subspace meta-learning approach for few-shot image captioning that improves the ability of large vision and language models to generate accurate descriptions with limited data.
Contribution
It proposes a multi-step chain-of-thought meta-learning scheme with subspace parameter learning to enhance few-shot image captioning performance.
Findings
Outperforms baseline methods on MSCOCO, Flickr8k, and Flickr30k datasets.
Demonstrates improved accuracy in few-shot image captioning tasks.
Validates the effectiveness of chain-of-thought meta-learning in multimodal settings.
Abstract
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior, which makes it easier to generate images and language that are more natural and realistic. Despite this, there is still a significant domain gap between the modalities of vision and language, especially when training data is scarce in few-shot settings, where only very limited data are available for training. In order to mitigate this issue, a multi-modal meta-learning framework has been proposed to bridge the gap between two frozen pretrained large vision and language models by introducing a tunable prompt connecting these two large models. For few-shot image captioning, the existing multi-model meta-learning framework utilizes a one-step prompting scheme to accumulate the visual features of input images to guide the language model, which struggles to generate accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
