Can masking background and object reduce static bias for zero-shot action recognition?
Takumi Fukuzawa, Kensho Hara, Hirokatsu Kataoka, Toru Tamaki

TL;DR
This paper investigates how masking backgrounds and objects during training can reduce static bias in zero-shot action recognition models, leading to improved focus on human actions and better performance across datasets.
Contribution
It introduces a masking approach during training to mitigate static bias in CLIP-based zero-shot action recognition models, enhancing their focus on human actions.
Findings
Masking background reduces static bias in Kinetics400.
Masking background improves performance on Mimetics.
Masking background and objects enhances SSv2 results.
Abstract
In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training · Focus
