Concadia: Towards Image-Based Text Generation with a Purpose
Elisa Kreiss, Fei Fang, Noah D. Goodman, Christopher Potts

TL;DR
This paper introduces Concadia, a new dataset distinguishing image descriptions from captions, and demonstrates that incorporating textual context improves image-to-text model performance for practical applications.
Contribution
It provides a novel dataset and analysis to differentiate descriptions from captions, and shows that context-aware models enhance image-to-text generation.
Findings
Context augmentation improves model accuracy
Descriptions and captions serve different communicative roles
The dataset enables better practical image-to-text applications
Abstract
Current deep learning models often achieve excellent results on benchmark image-to-text datasets but fail to generate texts that are useful in practice. We argue that to close this gap, it is vital to distinguish descriptions from captions based on their distinct communicative roles. Descriptions focus on visual features and are meant to replace an image (often to increase accessibility), whereas captions appear alongside an image to supply additional information. To motivate this distinction and help people put it into practice, we introduce the publicly available Wikipedia-based dataset Concadia consisting of 96,918 images with corresponding English-language descriptions, captions, and surrounding context. Using insights from Concadia, models trained on it, and a preregistered human-subjects experiment with human- and model-generated texts, we characterize the commonalities and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
