Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval
Hang Cheng, Fanhe Dong, Long Zeng

TL;DR
This paper introduces Diff-SBSR, a zero-shot sketch-based 3D shape retrieval method leveraging frozen diffusion models enhanced with multimodal features from CLIP and BLIP, achieving superior performance without retraining.
Contribution
It proposes a novel multimodal feature-enhanced strategy for diffusion models, enabling effective zero-shot sketch-based 3D retrieval without costly retraining.
Findings
Outperforms state-of-the-art methods on public benchmarks
Effectively leverages CLIP and BLIP for semantic enhancement
Demonstrates robustness to sketch noise and domain gap
Abstract
This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
