MatchFormer: Interleaving Attention in Transformers for Feature Matching
Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen

TL;DR
MatchFormer introduces a hierarchical transformer that interleaves self- and cross-attention for feature extraction and matching, significantly improving efficiency and robustness in local feature matching tasks across various scenarios.
Contribution
The paper proposes a novel hierarchical extract-and-match transformer with interleaved attention mechanisms, enhancing feature matching efficiency and robustness, especially in low-texture scenes.
Findings
Achieves 41% faster speed with only 45% GFLOPs compared to previous methods.
Outperforms state-of-the-art on multiple benchmarks including indoor and outdoor pose estimation.
Improves matching robustness in low-texture and limited training data scenarios.
Abstract
Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
