A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua

TL;DR
ReaDe is a universal interpreter that converts ambiguous user instructions into detailed, actionable specifications, significantly improving controllability and fidelity in diffusion-based video generation.
Contribution
It introduces a reason-then-describe paradigm and a two-stage training process to enhance instruction understanding and controllability in video synthesis models.
Findings
Improved instruction fidelity and caption accuracy.
Enhanced downstream video quality.
Strong generalization to complex and unseen inputs.
Abstract
Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The method is well-motivated: it leverages MLLMs to handle multimodal inputs and uses RL to enhance generalization, resulting in more faithful and controllable video generation. - Comprehensive experiments demonstrate ReaDe’s effectiveness across multiple metrics and condition types, outperforming competitive baselines like Any2Caption.
- The claim of being *model-agnostic* is insufficiently supported. Quantitative experiments only use CogVideoX-2B; no results are shown for other video generators, limiting the generalizability of this claim. - The ablation study lacks rigorous comparisons to justify the statement that $R_{\text{user}}$ and $R_{\text{detail}}$ contribute the most. Table 6 does not include controlled experiments that isolate the effect of individual reward components, making the conclusion less convincing.
1. The paper is well written. 2. The author uses reinforcement learning to improve the accuracy of the prompts given by the interpreter, and experiments show that better results were achieved.
1. From an insight perspective, it has been validated by many previous works that accurate prompts lead to better generation results. Therefore, designing a more accurate VLM has not brought additional benefits to this task. 2. The data used is sourced from GPT-4o, so I believe this ability is a distillation of 4o on specific tasks, which isn't particularly interesting. 3. Improving the video captioning ability of VLMs has been the focus of much research before. Moreover, compared to previous
The paper proposes ReaDe, the first universal video instruction interpreter for controllable video generation. The presentation is clear, coherent, and well-structured. The work introduces new data construction and reward design strategies. The experimental comparisons are comprehensive and sufficient.
The paper does not include a reward curve, which is important for illustrating training dynamics and stability. It remains unclear whether there are any reward conflicts or reward hacking phenomena during optimization. Fine-tuning an open-sourced model (e.g., Wan) on the refined caption dataset generated by the proposed method could provide a more convincing analysis. Currently, the refined captions are only used during inference, which is not sufficient to fully validate the method’s effectiv
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation
