PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Minh-Quan Le; Gaurav Mittal; Cheng Zhao; David Gu; Dimitris Samaras; Mei Chen

arXiv:2602.01624·cs.CV·February 3, 2026

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Minh-Quan Le, Gaurav Mittal, Cheng Zhao, David Gu, Dimitris Samaras, Mei Chen

PDF

Open Access

TL;DR

PISCES introduces an annotation-free post-training method for text-to-video generation that uses optimal transport to align reward signals with human judgment, improving quality and semantic consistency without requiring annotations.

Contribution

It proposes a novel Dual OT-aligned Rewards module that enhances reward supervision in text-to-video generation without annotations, leveraging optimal transport at distributional and token levels.

Findings

01

Outperforms existing methods on VBench in quality and semantic scores.

02

Validates effectiveness through human preference studies.

03

Compatible with multiple optimization paradigms.

Abstract

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $PISCES$ , an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $PISCES$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis