TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification
Shengcai Liao, Ling Shao

TL;DR
TransMatcher introduces a novel Transformer-based approach for person re-identification, emphasizing a simplified decoder that enhances matching performance and generalizability across datasets.
Contribution
The paper proposes a new simplified decoder for Transformers tailored for image matching, significantly improving generalizable person re-identification performance.
Findings
Achieves state-of-the-art results with up to 6.1% Rank-1 improvement
Demonstrates the effectiveness of the simplified decoder for image matching
Shows better generalization across multiple datasets
Abstract
Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Vision Transformer · Max Pooling · Label Smoothing · Layer Normalization · Byte Pair Encoding
