Other Tokens Matter: Exploring Global and Local Features of Vision   Transformers for Object Re-Identification

Yingquan Wang; Pingping Zhang; Dong Wang; Huchuan Lu

arXiv:2404.14985·cs.CV·April 24, 2024

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

Yingquan Wang, Pingping Zhang, Dong Wang, Huchuan Lu

PDF

Open Access

TL;DR

This paper investigates the roles of global and local features in Vision Transformers for object Re-Identification, proposing a novel model that effectively combines these features to improve re-identification accuracy.

Contribution

The paper introduces a Global-Local Transformer (GLTrans) with a Global Aggregation Encoder and Local Multi-layer Fusion to enhance feature representation for object Re-ID.

Findings

01

Achieves superior performance on four Re-ID benchmarks.

02

Global and local features mutually enhance each other.

03

Features from last Transformer layers are highly representative.

Abstract

Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global-local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Dropout · Dense Connections · Label Smoothing · Residual Connection · Softmax · Adam