Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Hang Cheng; Fanhe Dong; Long Zeng

arXiv:2604.19135·cs.CV·April 22, 2026

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Hang Cheng, Fanhe Dong, Long Zeng

PDF

TL;DR

This paper introduces Diff-SBSR, a zero-shot sketch-based 3D shape retrieval method leveraging frozen diffusion models enhanced with multimodal features from CLIP and BLIP, achieving superior performance without retraining.

Contribution

It proposes a novel multimodal feature-enhanced strategy for diffusion models, enabling effective zero-shot sketch-based 3D retrieval without costly retraining.

Findings

01

Outperforms state-of-the-art methods on public benchmarks

02

Effectively leverages CLIP and BLIP for semantic enhancement

03

Demonstrates robustness to sketch noise and domain gap

Abstract

This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.