CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization
Shigemichi Matsuzaki, Kazuhito Tanaka, Kazuhiro Shintani

TL;DR
This paper introduces CLIP-Clique, a global localization method that enhances semantic graph matching with Vision Language Model embeddings and deterministic inlier estimation, improving robustness and accuracy in object-based localization.
Contribution
It proposes augmenting semantic graph matching with VLM embeddings and a deterministic inlier estimation method, advancing object-based global localization techniques.
Findings
Improved matching accuracy on ScanNet and TUM datasets.
Enhanced pose estimation robustness with the proposed approach.
Demonstrated superiority over traditional RANSAC-based methods.
Abstract
This letter proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and sensitive to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Convolution · 1x1 Convolution · Thinned U-shape Module
