CapText: Large Language Model-based Caption Generation From Image   Context and Description

Shinjini Ghosh; Sagnik Anupam

arXiv:2306.00301·cs.LG·June 7, 2023·1 cites

CapText: Large Language Model-based Caption Generation From Image Context and Description

Shinjini Ghosh, Sagnik Anupam

PDF

Open Access

TL;DR

This paper introduces CapText, a novel method that uses large language models to generate image captions solely from textual descriptions and context, bypassing direct image processing.

Contribution

The paper presents a new approach leveraging large language models for caption generation from text-only input, outperforming existing image-text alignment models after fine-tuning.

Findings

01

Outperforms state-of-the-art models like OSCAR-VinVL on CIDEr metric

02

Effective caption generation using only textual descriptions and context

03

Demonstrates the potential of large language models in image captioning

Abstract

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary information about an image, while models tend to produce descriptions that describe the visual features of the image. Prior research in caption generation has explored the use of models that generate captions when provided with the images alongside their respective descriptions or contexts. We propose and evaluate a new approach, which leverages existing large language models to generate captions from textual descriptions and context alone, without ever processing the image directly. We demonstrate that after fine-tuning, our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling