ImageInWords: Unlocking Hyper-Detailed Image Descriptions

Roopal Garg; Andrea Burns; Burcu Karagol Ayan; Yonatan Bitton; and Ceslee Montgomery; Yasumasa Onoe; Andrew Bunner; Ranjay Krishna; and Jason Baldridge; Radu Soricut

arXiv:2405.02793·cs.CV·October 30, 2024

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, and Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, and Jason Baldridge, Radu Soricut

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces ImageInWords, a human-in-the-loop framework for creating hyper-detailed image descriptions that significantly improve description quality and downstream task performance.

Contribution

The paper presents a novel data-centric approach with a human-in-the-loop system for curating detailed image descriptions, leading to substantial improvements over existing datasets and models.

Findings

01

Major gains in description quality (+66% comprehensiveness, +48% GPT4V)

02

Fine-tuning with IIW data improves metrics by +31% with only 9k samples

03

Enhanced image generation fidelity and reasoning performance

Abstract

Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

google/imageinwords
dataset· 89 dl
89 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Biomedical Text Mining and Ontologies