TL;DR
This paper introduces a unified deep learning framework that jointly learns local features and aggregates them into compact global representations for image retrieval, improving efficiency and accuracy.
Contribution
The proposed method combines feature learning and aggregation into an end-to-end trainable model using visual tokens and attention mechanisms, advancing image retrieval techniques.
Findings
Outperforms state-of-the-art on Revisited Oxford and Paris datasets
Generates compact global representations with regional matching capability
Efficiently combines feature learning and aggregation in a unified framework
Abstract
In image retrieval, deep local features learned in a data-driven manner have been demonstrated effective to improve retrieval performance. To realize efficient retrieval on large image database, some approaches quantize deep local features with a large codebook and match images with aggregated match kernel. However, the complexity of these approaches is non-trivial with large memory footprint, which limits their capability to jointly perform feature learning and aggregation. To generate compact global representations while maintaining regional matching capability, we propose a unified framework to jointly learn local feature representation and aggregation. In our framework, we first extract deep local features using CNNs. Then, we design a tokenizer module to aggregate them into a few visual tokens, each corresponding to a specific visual pattern. This helps to remove background noise,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
