Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong, Zhang

TL;DR
This paper introduces Image Textualization, an automatic framework that leverages multi-modal large language models and vision experts to generate detailed, high-quality image descriptions, addressing the limitations of existing datasets.
Contribution
The paper proposes a novel collaborative framework for automatic image description generation and introduces benchmarks for evaluating detailed descriptions.
Findings
IT framework produces richer, more detailed descriptions
LLaVA-7B trained on IT descriptions shows improved detail and reduced hallucination
Benchmark results validate the quality of generated descriptions
Abstract
Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
