REST: REtrieve & Self-Train for generative action recognition

Adrian Bulat; Enrique Sanchez; Brais Martinez; Georgios; Tzimiropoulos

arXiv:2209.15000·cs.CV·September 30, 2022

REST: REtrieve & Self-Train for generative action recognition

Adrian Bulat, Enrique Sanchez, Brais Martinez, Georgios, Tzimiropoulos

PDF

Open Access

TL;DR

This paper introduces REST, a novel self-training framework that adapts generative vision-language models for open-world, caption-based video action recognition without requiring labeled data.

Contribution

It is the first to adapt generative V&L models for action recognition using pseudo-captioning and self-training, avoiding overfitting and enabling zero-shot recognition.

Findings

01

REST achieves competitive zero-shot action recognition performance.

02

Both pseudo-caption generation and retrieval are essential for high accuracy.

03

The method outperforms contrastive learning approaches in open-world settings.

Abstract

This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsContrastive Learning · Contrastive Language-Image Pre-training