Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu

TL;DR
Pix2Cap-COCO introduces a pixel-level caption dataset and a new panoptic segmentation-captioning task, enabling models to generate detailed, instance-specific descriptions for fine-grained visual understanding.
Contribution
The paper presents the first panoptic pixel-level caption dataset and a novel task, along with a baseline model and demonstrates improved performance of large multimodal models through supervised fine-tuning.
Findings
Pix2Cap-COCO contains 167,254 detailed captions with an average of 22.94 words.
The dataset is challenging, requiring fine-grained visual and language understanding.
Fine-tuning with Pix2Cap-COCO improves model performance on multiple benchmarks.
Abstract
We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
