CATs++: Boosting Cost Aggregation with Convolutions and Transformers
Seokju Cho, Sunghwan Hong, Seungryong Kim

TL;DR
CATs++ introduces a novel transformer-based cost aggregation method for image matching that leverages global receptive fields, significantly improving robustness and accuracy over previous CNN-based approaches.
Contribution
This paper presents CATs++, an extension of CATs, combining transformers with architectural innovations to enhance cost aggregation in image matching, overcoming CNN limitations and reducing computational costs.
Findings
Outperforms previous state-of-the-art on PF-WILLOW, PF-PASCAL, and SPair-71k datasets.
Demonstrates significant accuracy improvements with extensive ablation studies.
Achieves robust matching under severe deformations.
Abstract
Cost aggregation is a highly important process in image matching tasks, which aims to disambiguate the noisy matching scores. Existing methods generally tackle this by hand-crafted or CNN-based methods, which either lack robustness to severe deformations or inherit the limitation of CNNs that fail to discriminate incorrect matches due to limited receptive fields and inadaptability. In this paper, we introduce Cost Aggregation with Transformers (CATs) to tackle this by exploring global consensus among initial correlation map with the help of some architectural designs that allow us to fully enjoy global receptive fields of self-attention mechanism. Also, to alleviate some of the limitations that CATs may face, i.e., high computational costs induced by the use of a standard transformer that its complexity grows with the size of spatial and feature dimensions, which restrict its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
