Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Geo Ahn; Inwoong Lee; Taeoh Kim; Minho Shim; Dongyoon Wee; Jinwoo Choi

arXiv:2601.16211·cs.CV·April 8, 2026

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi

PDF

TL;DR

This paper identifies and mitigates object-driven shortcut learning in zero-shot compositional action recognition, proposing methods to improve models' reliance on temporal cues for better generalization to unseen verb-object combinations.

Contribution

It introduces RCORE, a framework with CPR and TORC components, to reduce shortcut learning and enhance compositional generalization in zero-shot action recognition.

Findings

01

RCORE reduces shortcut diagnostics in models.

02

RCORE improves generalization to unseen verb-object pairs.

03

Temporal cues are crucial for compositional action recognition.

Abstract

Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts (i.e., relying on the labeled object class) rather than temporal evidence. We argue that sparse compositional supervision and verb-object learning asymmetry can promote object-driven shortcut learning. Our analysis with proposed diagnostic metrics shows that existing methods overfit to training co-occurrence patterns and underuse temporal verb cues, resulting in weak generalization to unseen compositions. To address object-driven shortcuts, we propose Robust COmpositional REpresentations (RCORE) with two components. Co-occurrence Prior Regularization (CPR) adds explicit supervision for unseen compositions and regularizes the model against frequent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.