Large-Scale Bidirectional Training for Zero-Shot Image Captioning

Taehoon Kim; Mark Marsden; Pyunghwan Ahn; Sangyun Kim; Sihaeng Lee,; Alessandra Sala; Seung Hwan Kim

arXiv:2211.06774·cs.CV·October 3, 2023

Large-Scale Bidirectional Training for Zero-Shot Image Captioning

Taehoon Kim, Mark Marsden, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee,, Alessandra Sala, Seung Hwan Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces BITTERS, a large-scale bidirectional training framework that enables zero-shot image captioning, along with a new benchmark for evaluation and an efficient finetuning method for keyword extraction.

Contribution

The paper presents a novel large-scale bidirectional training approach for zero-shot image captioning and a comprehensive evaluation benchmark.

Findings

01

Bidirectional training improves zero-shot captioning accuracy.

02

Careful dataset and architecture selection are crucial.

03

Proposed finetuning method enhances keyword extraction.

Abstract

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tgisaturday/BITTERS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

Methodsfail