Can masking background and object reduce static bias for zero-shot   action recognition?

Takumi Fukuzawa; Kensho Hara; Hirokatsu Kataoka; Toru Tamaki

arXiv:2501.12681·cs.CV·January 23, 2025

Can masking background and object reduce static bias for zero-shot action recognition?

Takumi Fukuzawa, Kensho Hara, Hirokatsu Kataoka, Toru Tamaki

PDF

TL;DR

This paper investigates how masking backgrounds and objects during training can reduce static bias in zero-shot action recognition models, leading to improved focus on human actions and better performance across datasets.

Contribution

It introduces a masking approach during training to mitigate static bias in CLIP-based zero-shot action recognition models, enhancing their focus on human actions.

Findings

01

Masking background reduces static bias in Kinetics400.

02

Masking background improves performance on Mimetics.

03

Masking background and objects enhances SSv2 results.

Abstract

In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training · Focus