REST: REtrieve & Self-Train for generative action recognition
Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios, Tzimiropoulos

TL;DR
This paper introduces REST, a novel self-training framework that adapts generative vision-language models for open-world, caption-based video action recognition without requiring labeled data.
Contribution
It is the first to adapt generative V&L models for action recognition using pseudo-captioning and self-training, avoiding overfitting and enabling zero-shot recognition.
Findings
REST achieves competitive zero-shot action recognition performance.
Both pseudo-caption generation and retrieval are essential for high accuracy.
The method outperforms contrastive learning approaches in open-world settings.
Abstract
This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Learning · Contrastive Language-Image Pre-training
