Conditional Cross Attention Network for Multi-Space Embedding without   Entanglement in Only a SINGLE Network

Chull Hwan Song; Taebaek Hwang; Jooyoung Yoon; Shunghyun Choi; Yeong; Hyeon Gu

arXiv:2307.13254·cs.CV·July 26, 2023

Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network

Chull Hwan Song, Taebaek Hwang, Jooyoung Yoon, Shunghyun Choi, Yeong, Hyeon Gu

PDF

Open Access

TL;DR

This paper introduces a Conditional Cross-Attention Network that creates disentangled multi-space embeddings for multiple object attributes within a single model, enhancing fine-grained image retrieval.

Contribution

It proposes a novel cross-attention mechanism combined with vision transformers to achieve attribute disentanglement in a unified network, improving robustness across datasets.

Findings

01

Achieved state-of-the-art results on multiple benchmark datasets.

02

Demonstrated effective attribute disentanglement through visualization.

03

First application of vision transformers to fine-grained image retrieval.

Abstract

Many studies in vision tasks have aimed to create effective embedding spaces for single-label object prediction within an image. However, in reality, most objects possess multiple specific attributes, such as shape, color, and length, with each attribute composed of various classes. To apply models in real-world scenarios, it is essential to be able to distinguish between the granular components of an object. Conventional approaches to embedding multiple specific attributes into a single network often result in entanglement, where fine-grained features of each attribute cannot be identified separately. To address this problem, we propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone. Firstly, we employ a cross-attention mechanism to fuse and switch the information of conditions (specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Linear Layer · Softmax · Layer Normalization · Dense Connections · Vision Transformer