CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language   Models

Kiet A. Nguyen; Adheesh Juvekar; Tianjiao Yu; Muntasir Wahed; Ismini; Lourentzou

arXiv:2412.19331·cs.CV·April 7, 2025

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Kiet A. Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini, Lourentzou

PDF

Open Access

TL;DR

This paper introduces CALICO, a vision-language model designed for part-focused semantic co-segmentation across multiple images, enabling detailed object part reasoning with minimal parameter tuning.

Contribution

CALICO is the first LVLM tailored for multi-image part-level reasoning, featuring novel modules for semantic correspondence extraction and adaptation.

Findings

01

CALICO achieves strong performance on the MixedParts dataset.

02

It requires only 0.3% of parameters to be finetuned.

03

The model effectively identifies common and unique object parts across images.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects, as well as common and unique object parts across images. To address this task, we present CALICO, the first LVLM designed for multi-image part-level reasoning segmentation. CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsALIGN