GraFT: Gradual Fusion Transformer for Multimodal Re-Identification
Haoli Yin, Jiayao Li (Emily), Eva Schiller, Luke McDermott, Daniel, Cummings

TL;DR
GraFT is a novel transformer-based model for multimodal object re-identification that uses learnable fusion tokens and a new training paradigm to improve feature integration and scalability across multiple modalities.
Contribution
Introduces GraFT, a transformer with learnable fusion tokens and an augmented triplet loss, enhancing multimodal ReID performance and scalability.
Findings
Outperforms existing multimodal ReID benchmarks.
Effective in capturing both modality-specific and object-specific features.
Pruning maintains performance while reducing model size.
Abstract
Object Re-Identification (ReID) is pivotal in computer vision, witnessing an escalating demand for adept multimodal representation learning. Current models, although promising, reveal scalability limitations with increasing modalities as they rely heavily on late fusion, which postpones the integration of specific modality insights. Addressing this, we introduce the \textbf{Gradual Fusion Transformer (GraFT)} for multimodal ReID. At its core, GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features. Further bolstering its efficacy, we introduce a novel training paradigm combined with an augmented triplet loss, optimizing the ReID feature embedding space. We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization
