Image Textualization: An Automatic Framework for Creating Accurate and   Detailed Image Descriptions

Renjie Pi; Jianshu Zhang; Jipeng Zhang; Rui Pan; Zhekai Chen; Tong; Zhang

arXiv:2406.07502·cs.CV·June 12, 2024·3 cites

Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

Renjie Pi, Jianshu Zhang, Jipeng Zhang, Rui Pan, Zhekai Chen, Tong, Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Image Textualization, an automatic framework that leverages multi-modal large language models and vision experts to generate detailed, high-quality image descriptions, addressing the limitations of existing datasets.

Contribution

The paper proposes a novel collaborative framework for automatic image description generation and introduces benchmarks for evaluating detailed descriptions.

Findings

01

IT framework produces richer, more detailed descriptions

02

LLaVA-7B trained on IT descriptions shows improved detail and reduced hallucination

03

Benchmark results validate the quality of generated descriptions

Abstract

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sterzhang/image-textualization
pytorchOfficial

Datasets

Sterzhang/image-textualization
dataset· 216 dl
216 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications