Altogether: Image Captioning via Re-aligning Alt-text

Hu Xu; Po-Yao Huang; Xiaoqing Ellen Tan; Ching-Feng Yeh; Jacob Kahn,; Christine Jou; Gargi Ghosh; Omer Levy; Luke Zettlemoyer; Wen-tau Yih,; Shang-Wen Li; Saining Xie; Christoph Feichtenhofer

arXiv:2410.17251·cs.CV·December 31, 2024

Altogether: Image Captioning via Re-aligning Alt-text

Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn,, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih,, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces Altogether, a method that enhances image captioning by re-aligning existing alt-text with images through iterative human annotation, resulting in richer captions and improved performance in related tasks.

Contribution

The paper proposes a novel approach to improve image captioning by re-aligning existing alt-texts with images using iterative human annotation, unlike prior methods that generate captions from scratch.

Findings

01

Richer image captions produced by the method.

02

Improved performance in text-to-image generation.

03

Enhanced zero-shot image classification results.

Abstract

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/metaclip
pytorchOfficial

Models

🤗
timm/vit_huge_patch14_clip_224.metaclip_altogether
model· 100 dl· ♡ 2
100 dl♡ 2

Videos

Altogether: Image Captioning via Re-aligning Alt-text· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques