MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation
Sungmin Cho, Sungbum Park, Insoo Oh

TL;DR
MUSE is a training-free, uncertainty-aware framework for zero-shot 2D object detection and segmentation that leverages multi-view templates and a joint similarity metric, achieving state-of-the-art results without additional training.
Contribution
MUSE introduces a novel training-free approach combining multi-view templates, a joint similarity metric, and uncertainty-aware refinement for zero-shot 2D object detection and segmentation.
Findings
Achieves state-of-the-art performance on BOP Challenge 2025.
Ranks first across multiple tracks without additional training.
Effectively combines global and local representations for robust matching.
Abstract
In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Visual Attention and Saliency Detection
