DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe; Sunayana Rane; Zachary Berger; Yonatan Bitton; Jaemin; Cho; Roopal Garg; Alexander Ku; Zarana Parekh; Jordi Pont-Tuset; Garrett; Tanzer; Su Wang; Jason Baldridge

arXiv:2404.19753·cs.CV·May 1, 2024·1 cites

DOCCI: Descriptions of Connected and Contrasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin, Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett, Tanzer, Su Wang, Jason Baldridge

PDF

Open Access 3 Datasets

TL;DR

DOCCI is a new dataset with detailed, human-annotated descriptions for 15,000 images, designed to improve image-to-text and text-to-image models by providing rich, challenging descriptions that highlight key visual and contextual details.

Contribution

The paper introduces DOCCI, a comprehensive dataset with long, detailed descriptions for images, enhancing training and evaluation of vision-language models with fine-grained, challenging data.

Findings

01

Finetuning PaLI 5B on DOCCI improves image-to-text performance.

02

DOCCI descriptions reveal limitations of current text-to-image models.

03

Dataset enables better understanding of spatial, counting, and world knowledge challenges.

Abstract

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging