CICA: Content-Injected Contrastive Alignment for Zero-Shot Document   Image Classification

Sankalp Sinha; Muhammad Saif Ullah Khan; Talha Uddin Sheikh; Didier; Stricker; Muhammad Zeshan Afzal

arXiv:2405.03660·cs.CV·May 7, 2024

CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification

Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier, Stricker, Muhammad Zeshan Afzal

PDF

Open Access

TL;DR

This paper introduces CICA, a novel framework that enhances CLIP's zero-shot document image classification by leveraging document-specific textual information, achieving significant accuracy improvements with minimal additional parameters.

Contribution

We propose CICA, a content-injected contrastive alignment framework that improves zero-shot document image classification by incorporating a new content module and a coupled-contrastive loss.

Findings

01

CICA improves CLIP's ZSL top-1 accuracy by 6.7%.

02

CICA increases GZSL harmonic mean by 24%.

03

The module adds only 3.3% more parameters to CLIP.

Abstract

Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques

MethodsALIGN · Focus · Contrastive Language-Image Pre-training