ImageInWords: Unlocking Hyper-Detailed Image Descriptions
Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, and Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, and Jason Baldridge, Radu Soricut

TL;DR
This paper introduces ImageInWords, a human-in-the-loop framework for creating hyper-detailed image descriptions that significantly improve description quality and downstream task performance.
Contribution
The paper presents a novel data-centric approach with a human-in-the-loop system for curating detailed image descriptions, leading to substantial improvements over existing datasets and models.
Findings
Major gains in description quality (+66% comprehensiveness, +48% GPT4V)
Fine-tuning with IIW data improves metrics by +31% with only 9k samples
Enhanced image generation fidelity and reasoning performance
Abstract
Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Biomedical Text Mining and Ontologies
