Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting   Region Captions

Yu-Guan Hsieh; Cheng-Yu Hsieh; Shih-Ying Yeh; Louis B\'ethune; Hadi; Pour Ansari; Pavan Kumar Anasosalu Vasu; Chun-Liang Li; Ranjay Krishna; Oncel; Tuzel; Marco Cuturi

arXiv:2407.06723·cs.CV·February 28, 2025

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis B\'ethune, Hadi, Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel, Tuzel, Marco Cuturi

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces graph-based captioning (GBC), a novel annotation strategy that encodes images as labeled graphs with hierarchical and relational information, improving vision-language models and enabling better text-to-image generation.

Contribution

The work proposes GBC, a new graph-structured annotation method for images, and creates GBC10M, a large dataset, demonstrating its benefits for model performance and image generation.

Findings

01

GBC annotations significantly improve model performance on benchmarks.

02

Automatic GBC generation is feasible with existing multimodal models.

03

Incorporating GBC in text-to-image tasks enhances generation quality.

Abstract

Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
graph-based-captions/GBC10M-PromptGen-200M
model· 28 dl· ♡ 4
28 dl♡ 4

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training