Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

Jiho Choi; Seonho Lee; Minhyun Lee; Seungho Lee; Hyunjung Shim

arXiv:2501.09688·cs.CV·August 11, 2025

Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim

PDF

Open Access 1 Repo

TL;DR

This paper introduces PartCATSeg, a framework for open-vocabulary part segmentation that improves fine-grained image-text correspondence and structural understanding using cost aggregation, compositional loss, and DINO features.

Contribution

The paper proposes a novel cost aggregation strategy, a compositional loss, and structural guidance from DINO to enhance open-vocabulary part segmentation performance.

Findings

01

Outperforms state-of-the-art on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.

02

Significantly improves boundary delineation and inter-part understanding.

03

Sets new benchmarks for generalization to unseen part categories.

Abstract

Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaist-cvml/part-catseg
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Residual Connection · Softmax · Vision Transformer · self-DIstillation with NO labels