TL;DR
RoMa is a new dense feature matching model that combines pretrained foundation features with fine features and a transformer decoder, achieving state-of-the-art robustness and accuracy in challenging scenarios.
Contribution
It introduces a novel combination of frozen foundation features, a specialized transformer decoder, and an improved loss for robust dense matching.
Findings
Achieves a 36% improvement on the WxBS benchmark.
Sets a new state-of-the-art in dense feature matching robustness.
Effectively combines coarse foundation features with fine local features.
Abstract
Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene, and dense methods estimate all such correspondences. The aim is to learn a robust model, i.e., a model able to match under challenging real-world changes. In this work, we propose such a model, leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch, they are inherently coarse. We therefore combine them with specialized ConvNet fine features, creating a precisely localizable feature pyramid. To further improve robustness, we propose a tailored transformer match decoder that predicts anchor probabilities, which enables it to express multimodality. Finally, we propose an improved loss formulation through regression-by-classification with subsequent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
