Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

Zuyao You; Junke Wang; Lingyu Kong; Bo He; Zuxuan Wu

arXiv:2501.13893·cs.CV·January 24, 2025

Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

Zuyao You, Junke Wang, Lingyu Kong, Bo He, Zuxuan Wu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Pix2Cap-COCO introduces a pixel-level caption dataset and a new panoptic segmentation-captioning task, enabling models to generate detailed, instance-specific descriptions for fine-grained visual understanding.

Contribution

The paper presents the first panoptic pixel-level caption dataset and a novel task, along with a baseline model and demonstrates improved performance of large multimodal models through supervised fine-tuning.

Findings

01

Pix2Cap-COCO contains 167,254 detailed captions with an average of 22.94 words.

02

The dataset is challenging, requiring fine-grained visual and language understanding.

03

Fine-tuning with Pix2Cap-COCO improves model performance on multiple benchmarks.

Abstract

We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. To achieve this, we carefully design an automated annotation pipeline that prompts GPT-4V to generate pixel-aligned, instance-specific captions for individual objects within images, enabling models to learn more granular relationships between objects and their contexts. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. Building on Pix2Cap-COCO, we introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously. To benchmark this task, we design a robust baseline based on X-Decoder. The experimental results demonstrate that Pix2Cap-COCO is a particularly challenging dataset, as it requires models to excel in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

geshang777/pix2cap
pytorchOfficial

Datasets

geshang/Pix2Cap-COCO
dataset· 98 dl
98 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications