GenRec: Unifying Video Generation and Recognition with Diffusion Models
Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang

TL;DR
GenRec is a unified diffusion-based framework that enhances video recognition and generation by learning generalized spatial-temporal representations, demonstrating robustness and competitive performance on multiple benchmarks.
Contribution
It introduces the first unified model trained with random-frame conditioning to jointly learn video generation and recognition capabilities.
Findings
Achieves 75.8% accuracy on SSV2 and 87.2% on K400 for recognition.
Sets new state-of-the-art FVD scores for class-conditioned video generation.
Demonstrates robustness with limited input frames.
Abstract
Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Data Compression Techniques
MethodsDiffusion
