Panoptic Captioning: An Equivalence Bridge for Image and Text
Kun-Yu Lin, Hongjun Wang, Weining Ren, Kai Han

TL;DR
This paper introduces panoptic captioning, a comprehensive image description task, and proposes new methods and metrics to improve and evaluate model performance, demonstrating significant advancements over existing large language models.
Contribution
It formulates panoptic captioning as a novel task, develops PancapEngine and PancapChain for data generation and stepwise captioning, and introduces PancapScore for evaluation.
Findings
PancapChain-13B outperforms state-of-the-art open-source models.
Our methods surpass proprietary models like GPT-4o and Gemini-2.0-Pro.
The proposed data engine and method significantly improve panoptic captioning performance.
Abstract
This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalent of images, which has broad potential applications. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state. Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
MethodsSparse Evolutionary Training
