DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition
Haozhe Cheng, Cheng Ju, Haicheng Wang, Jinxiang Liu, Mengting Chen,, Qiang Hu, Xiaoyun Zhang, Yanfeng Wang

TL;DR
This paper addresses the vulnerability of open-vocabulary action recognition models to noisy class descriptions by proposing a novel denoising framework that enhances robustness through iterative generation and discrimination, improving real-world applicability.
Contribution
The paper introduces DENOISER, a new framework that denoises noisy class descriptions and improves OVAR robustness, filling a gap in handling real-world noisy labels.
Findings
DENOISER significantly improves robustness against noisy class descriptions.
The iterative generation and discrimination process enhances classification accuracy.
Extensive experiments validate the effectiveness of the proposed method across datasets.
Abstract
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Natural Language Processing Techniques
