CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification
Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier, Stricker, Muhammad Zeshan Afzal

TL;DR
This paper introduces CICA, a novel framework that enhances CLIP's zero-shot document image classification by leveraging document-specific textual information, achieving significant accuracy improvements with minimal additional parameters.
Contribution
We propose CICA, a content-injected contrastive alignment framework that improves zero-shot document image classification by incorporating a new content module and a coupled-contrastive loss.
Findings
CICA improves CLIP's ZSL top-1 accuracy by 6.7%.
CICA increases GZSL harmonic mean by 24%.
The module adds only 3.3% more parameters to CLIP.
Abstract
Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsALIGN · Focus · Contrastive Language-Image Pre-training
