Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models
Khyathi Raghavi Chandu, Piyush Sharma, Soravit Changpinyo, Ashish, Thapliyal, Radu Soricut

TL;DR
This paper introduces a skeleton-based approach to improve large-scale image captioning from noisy alt-text data, enhancing caption quality, controllability, and cross-lingual transferability.
Contribution
It proposes breaking down captioning into skeleton prediction and caption generation, enabling denoising, better control, and multilingual capabilities from noisy datasets.
Findings
Skeleton prediction improves caption quality from noisy data.
Cross-lingual caption generation using English skeletons is effective.
Skeleton-based control allows for adjustable caption properties.
Abstract
Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data. However, recent modeling approaches to IC often fall short in terms of performance in this case, because they assume a clean annotated dataset (as opposed to the noisier alt-text--based annotations), and employ an end-to-end generation approach, which often lacks both controllability and interpretability. We address these problems by breaking down the task into two simpler, more controllable tasks -- skeleton prediction and skeleton-based caption generation. Specifically, we show that selecting content words as skeletons} helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets. We also show that the predicted English skeletons can be further cross-lingually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsInterpretability
