I2DFormer: Learning Image to Document Attention for Zero-Shot Image   Classification

Muhammad Ferjad Naeem; Yongqin Xian; Luc Van Gool; Federico Tombari

arXiv:2209.10304·cs.CV·September 22, 2022·21 cites

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Muhammad Ferjad Naeem, Yongqin Xian, Luc Van Gool, Federico Tombari

PDF

Open Access 1 Video

TL;DR

I2DFormer introduces a transformer-based framework that leverages online textual documents to improve zero-shot image classification by aligning images and documents in a shared space, enhancing interpretability and performance.

Contribution

The paper proposes a novel cross-modal attention module and a transformer-based approach that jointly encodes images and documents for improved zero-shot learning without human-annotated attributes.

Findings

01

Outperforms previous unsupervised semantic embeddings in ZSL tasks

02

Learns discriminative document embeddings capturing visual similarities

03

Enables grounding of document words in image regions

Abstract

Despite the tremendous progress in zero-shot learning(ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI