GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning
Xiaoyi Bao, Jindi Lv, Xiaofeng Wang, Zheng Zhu, Xinze Chen, YuKun Zhou, Jiancheng Lv, Xingang Wang, Guan Huang

TL;DR
GigaVideo-1 introduces an efficient, annotation-free fine-tuning framework for video diffusion models that leverages automatic feedback to enhance multiple video quality dimensions with minimal computational resources.
Contribution
It presents a novel automatic feedback-based fine-tuning method that improves pre-trained video diffusion models without human annotations or large datasets.
Findings
Achieves about 4% performance improvement across evaluation dimensions.
Requires only 4 GPU-hours for fine-tuning.
No manual annotations needed, demonstrating high efficiency.
Abstract
Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse,…
Peer Reviews
Decision·Submitted to ICLR 2026
- 4 GPU-hours is orders of magnitude cheaper than prior SFT/RL works, making the method attractive for practitioners. - The prompt engine explicitly amplifies failure modes, leading to a stronger training signal than random web videos. - 17 dimensions, 5 strong baselines, user study, ablation of data source & reward strategy, and tests on four different architectures (2B–13B).
- Sec. 4.3 shows that mixing synthetic prompts with synthetic videos ($P_sV_s$+$P_rV_s$) actually hurts accuracy, hinting that some LLM-generated captions are too exotic and push the model away from realism. - How do you filter or validate the LLM-generated captions to prevent physically impossible or nonsensical queries (e.g., “a person with three elbows”)? Could such cases bias the model toward hallucination? - Have you tried a single, unified reward model (e.g., training a small diffusion cr
1. The paper is presented with clarity and is easy to understand. The contributions are clearly articulated (1. the data engine; 2. the optimization method). 2. The framework demonstrates strong generalizability. As validated in the appendix, it brings consistent performance gains when applied to various video model backbones (e.g., CogVideoX, HunyuanVideo), proving it is a versatile and portable solution rather than a model-specific trick. 3. A major contribution of this paper lies in its wel
1. I'm curious about the statement on **line 314**: "the synthetic dataset is generated by different pre-trained T2V models." Specifically, which T2V models were used for this purpose? I mean, if you're fine-tuning a Wan2.1-1.3B model but the synthetic data is generated using Wan2.1-14B, wouldn't the time required for synthetic data generation be excessively long? 2. Another concern centers on the unvalidated effectiveness of the MLLM-based evaluation. Without a correlation analysis between MLL
This paper is complete, constructing an end-to-end automated fine-tuning system that includes all stages from data generation and evaluation to optimization. The experiment results on VBench 2.0 seems good.
1. Lack of core innovation. This is the most critical flaw, and the authors need to clarify their contributions and significances. From my perspective, the method proposed in this paper is essentially "Automated DPO/RWR". The entire pipeline can be summarized as: LLM generates targeted prompts -> base model generate the videos -> MLLM scores -> score-weighted loss training. Every module in this paradigm is off-the-shelf, and the combination method is straight forward. 2. Lack of in-depth experi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Image and Video Quality Assessment
MethodsFocus · Diffusion
