DENOISER: Rethinking the Robustness for Open-Vocabulary Action   Recognition

Haozhe Cheng; Cheng Ju; Haicheng Wang; Jinxiang Liu; Mengting Chen,; Qiang Hu; Xiaoyun Zhang; Yanfeng Wang

arXiv:2404.14890·cs.CV·April 24, 2024

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

Haozhe Cheng, Cheng Ju, Haicheng Wang, Jinxiang Liu, Mengting Chen,, Qiang Hu, Xiaoyun Zhang, Yanfeng Wang

PDF

Open Access

TL;DR

This paper addresses the vulnerability of open-vocabulary action recognition models to noisy class descriptions by proposing a novel denoising framework that enhances robustness through iterative generation and discrimination, improving real-world applicability.

Contribution

The paper introduces DENOISER, a new framework that denoises noisy class descriptions and improves OVAR robustness, filling a gap in handling real-world noisy labels.

Findings

01

DENOISER significantly improves robustness against noisy class descriptions.

02

The iterative generation and discrimination process enhances classification accuracy.

03

Extensive experiments validate the effectiveness of the proposed method across datasets.

Abstract

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Natural Language Processing Techniques