Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis B\'ethune, Hadi, Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel, Tuzel, Marco Cuturi

TL;DR
This paper introduces graph-based captioning (GBC), a novel annotation strategy that encodes images as labeled graphs with hierarchical and relational information, improving vision-language models and enabling better text-to-image generation.
Contribution
The work proposes GBC, a new graph-structured annotation method for images, and creates GBC10M, a large dataset, demonstrating its benefits for model performance and image generation.
Findings
GBC annotations significantly improve model performance on benchmarks.
Automatic GBC generation is feasible with existing multimodal models.
Incorporating GBC in text-to-image tasks enhances generation quality.
Abstract
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
