MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang; Jiaming Zhang; Kailun Yang; Kunyu Peng; Rainer Stiefelhagen

arXiv:2203.09645·cs.CV·September 27, 2022

MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen

PDF

Open Access 1 Repo

TL;DR

MatchFormer introduces a hierarchical transformer that interleaves self- and cross-attention for feature extraction and matching, significantly improving efficiency and robustness in local feature matching tasks across various scenarios.

Contribution

The paper proposes a novel hierarchical extract-and-match transformer with interleaved attention mechanisms, enhancing feature matching efficiency and robustness, especially in low-texture scenes.

Findings

01

Achieves 41% faster speed with only 45% GFLOPs compared to previous methods.

02

Outperforms state-of-the-art on multiple benchmarks including indoor and outdoor pose estimation.

03

Improves matching robustness in low-texture and limited training data scenarios.

Abstract

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jamycheung/matchformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings