Learning to Generalize without Bias for Open-Vocabulary Action Recognition
Yating Yu, Congqi Cao, Yifan Zhang, Yanning Zhang

TL;DR
This paper introduces Open-MeDe, a meta-optimization framework that reduces static bias in CLIP-based video learners, significantly improving open-vocabulary action recognition especially for out-of-context actions.
Contribution
Open-MeDe employs a novel meta-learning approach with cross-batch optimization and self-ensemble to enhance generalization and mitigate static bias in open-vocabulary action recognition.
Findings
Outperforms state-of-the-art regularization methods in in-context recognition
Significantly improves out-of-context action recognition
Achieves robust generalization across diverse scenarios
Abstract
Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation
MethodsContrastive Language-Image Pre-training · ADaptive gradient method with the OPTimal convergence rate
