Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Maximilian Ulmer; Wout Boerdijk; Rudolph Triebel; and Maximilian Durner

arXiv:2508.04122·cs.CV·August 7, 2025

Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, and Maximilian Durner

PDF

TL;DR

This paper introduces OC-DiT, a diffusion-based framework for zero-shot instance segmentation that generates object masks conditioned on object templates and image features, achieving state-of-the-art results without retraining.

Contribution

The paper presents a novel conditional latent diffusion model for zero-shot instance segmentation, including a coarse proposal generator and a refinement model trained on synthetic data.

Findings

01

Achieves state-of-the-art performance on real-world benchmarks.

02

Effectively disentangles object instances through diffusion process.

03

Operates without retraining on target datasets.

Abstract

This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.