Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim

TL;DR
This paper introduces PartCATSeg, a framework for open-vocabulary part segmentation that improves fine-grained image-text correspondence and structural understanding using cost aggregation, compositional loss, and DINO features.
Contribution
The paper proposes a novel cost aggregation strategy, a compositional loss, and structural guidance from DINO to enhance open-vocabulary part segmentation performance.
Findings
Outperforms state-of-the-art on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.
Significantly improves boundary delineation and inter-part understanding.
Sets new benchmarks for generalization to unseen part categories.
Abstract
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Residual Connection · Softmax · Vision Transformer · self-DIstillation with NO labels
