GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Shivanshu Shekhar; Uttaran Bhattacharya; Raghavendra Addanki; Mehrab Tanjim; Somdeb Sarkhel; Tong Zhang

arXiv:2602.05202·cs.CV·February 6, 2026

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang

PDF

Open Access

TL;DR

This paper introduces GT-SVJ, a novel approach that repurposes video generative models as reward models for evaluating video quality, leveraging their temporal modeling capabilities and contrastive training to outperform existing methods with fewer annotations.

Contribution

The paper presents a new method that transforms video generative models into temporally-aware reward models using energy-based reformulation and contrastive learning, reducing reliance on large annotated datasets.

Findings

01

Achieves state-of-the-art performance on GenAI-Bench and MonteBench.

02

Uses only 30K human annotations, significantly fewer than existing methods.

03

Effectively discriminates video quality through synthetic negative video generation.

Abstract

Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition