Anticipating Future Object Compositions without Forgetting
Youssef Zahran, Gertjan Burghouts, Yke Bauke Eisma

TL;DR
This paper advances compositional zero-shot learning in object detection by integrating grounding, soft prompting, anticipation, and contrastive tuning, significantly improving generalization to novel object-attribute combinations.
Contribution
It introduces a novel framework combining compositional soft prompting, anticipation, and contrastive tuning to enhance object detection in CZSL without forgetting prior knowledge.
Findings
70.5% improvement over CSP on harmonic mean in CLEVR
14.5% increase in harmonic mean across datasets
Effective learning of compositions with limited data
Abstract
Despite the significant advancements in computer vision models, their ability to generalize to novel object-attribute compositions remains limited. Existing methods for Compositional Zero-Shot Learning (CZSL) mainly focus on image classification. This paper aims to enhance CZSL in object detection without forgetting prior learned knowledge. We use Grounding DINO and incorporate Compositional Soft Prompting (CSP) into it and extend it with Compositional Anticipation. We achieve a 70.5% improvement over CSP on the harmonic mean (HM) between seen and unseen compositions on the CLEVR dataset. Furthermore, we introduce Contrastive Prompt Tuning to incrementally address model confusion between similar compositions. We demonstrate the effectiveness of this method and achieve an increase of 14.5% in HM across the pretrain, increment, and unseen sets. Collectively, these methods provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpace Science and Extraterrestrial Life
MethodsAttention Is All You Need · Softmax · Residual Connection · Layer Normalization · Focus · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels
