Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

Takehiro Aoshima; Yusuke Shinohara; Byeongseon Park

arXiv:2510.19193·cs.CV·October 24, 2025

Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

Takehiro Aoshima, Yusuke Shinohara, Byeongseon Park

PDF

Open Access

TL;DR

This paper introduces Video Consistency Distance (VCD), a novel frequency-domain metric for improving temporal consistency in image-to-video generation, and demonstrates its effectiveness through reward-based fine-tuning without sacrificing quality.

Contribution

The paper proposes VCD, a new frequency-space metric for temporal consistency, and applies it in reward-based fine-tuning to enhance video coherence in I2V tasks.

Findings

01

VCD significantly improves temporal consistency in generated videos.

02

Fine-tuning with VCD does not degrade overall video quality.

03

Experimental results validate VCD's effectiveness across multiple datasets.

Abstract

Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Generative Adversarial Networks and Image Synthesis · Image Enhancement Techniques