Segment Anything Model is a Good Teacher for Local Feature Learning
Jingqian Wu, Rongtao Xu, Zach Wood-Doughty, Changwei Wang, Shibiao Xu, Edmund Y. Lam

TL;DR
This paper introduces SAMFeat, a novel local feature learning framework guided by the Segment Anything Model (SAM), which enhances feature description and detection through semantic relation distillation, semantic grouping, and edge attention, achieving superior results in image matching and localization.
Contribution
The paper presents SAMFeat, a new approach that leverages SAM as a teacher for local feature learning, introducing techniques like ASRD, WSC, and EAG to improve performance on limited datasets.
Findings
Outperforms previous local features on image matching tasks.
Achieves superior results in long-term visual localization.
Demonstrates effective semantic-guided feature learning.
Abstract
Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The authors proposed to leverage the foundational segmentation model SAM for local feature learning. As highlighted in the paper, this work is the first one that incorporates SAM for local feature learning by distilling the knowledge from SAM. 2. The authors proposed three techniques to transfer the fine-grained image understanding knowledge from SAM to the proposed local feature learning pipeline, which results in a new local feature detector called SAMFeat. 3. The experimental results on
1. The proposed method in this work is heuristic and incremental. Though the combination of all three techniques achieves the best performance, it is not clear how each of the heuristic technique improve the local feature learning and further the final performance. I would highly suggest the authors have a deeper study on the proposed techniques on how they are contributing the final performance. 2. It is not clear how much overhead for the training after adding the extra loss functions. For ex
(1) The authors integrate the strengths of existing frameworks, effectively utilizing the SAM foundation model and successfully distilling its knowledge into the network for local descriptor learning. It is a good paper for leveraging the knowledge of large models to enhance domain-specific tasks effectively. (2)The article is clearly written in most parts, enabling readers to quickly catch up on the core technical points. The proposed approach is quite reasonable. (3)The experimental results
(1) My first concern is about the novelty of the paper. It is commendable to leverage SAM to enhance model performance in corresponding tasks. However, acquiring structured information through SAM (PSRD), and using semantic grouping to construct positive and negative samples, thereby introducing contrastive learning, have already been briefly discussed in previous works (SFD2, TPR). From this, the paper is more like an integration of some schemes combined with the SAM model. Hence, its technica
1. This paper exlpored a way to release the power of Segment Anything Model (SAM) for distillation for local features. It shows the potential s of visual foundation models. 2. Experiment-wise, it reaches state of the art results for image matching for with different on HPatches dataset and visual localization task on Archen V1.1 dataset. 3. Authors provide open-source code
Although I believe in the soundness of the good results that the authors have demonstrated, a major issue that makes me skeptical is whether the contribution and novelty are substantial enough to warrant a full paper. Many of the techniques used in the paper are borrowed from other's implementation. For example, the Pixel Semantic Relational Distillation (PSRD) is to compare two similarity matrix which is a widely used knowledge distillation loss [1]. Then the semantic grouping is from the origi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsSegment Anything Model · Contrastive Learning
