DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin, Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett, Tanzer, Su Wang, Jason Baldridge

TL;DR
DOCCI is a new dataset with detailed, human-annotated descriptions for 15,000 images, designed to improve image-to-text and text-to-image models by providing rich, challenging descriptions that highlight key visual and contextual details.
Contribution
The paper introduces DOCCI, a comprehensive dataset with long, detailed descriptions for images, enhancing training and evaluation of vision-language models with fine-grained, challenging data.
Findings
Finetuning PaLI 5B on DOCCI improves image-to-text performance.
DOCCI descriptions reveal limitations of current text-to-image models.
Dataset enables better understanding of spatial, counting, and world knowledge challenges.
Abstract
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
