Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

Paul Julius K\"uhn; Cedric Spengler; Michael Weinmann; Arjan Kuijper; Saptarshi Neil Sinha

arXiv:2603.06982·cs.CV·March 10, 2026

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

Paul Julius K\"uhn, Cedric Spengler, Michael Weinmann, Arjan Kuijper, Saptarshi Neil Sinha

PDF

Open Access

TL;DR

This paper introduces a novel approach for image-based 3D shape retrieval that leverages pre-aligned multi-modal encoders and a hard contrastive loss, achieving state-of-the-art results without view synthesis or domain-specific retraining.

Contribution

It proposes using pre-aligned image and shape encoders for zero-shot and supervised retrieval, along with a multi-modal hard contrastive loss to enhance performance, bypassing view synthesis.

Findings

01

Achieves state-of-the-art accuracy on multiple datasets.

02

Outperforms existing methods in zero-shot and supervised settings.

03

Training with HCL improves retrieval on shape-centric datasets.

Abstract

Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Medical Image Segmentation Techniques