FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song; Mingi Kwon; Jaeseok Jeong; Youngjung Uh

arXiv:2512.24724·cs.CV·January 1, 2026

FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

PDF

Open Access

TL;DR

FlowBlending introduces a stage-aware multi-model sampling method that dynamically uses large or small models at different video generation stages, significantly speeding up inference while preserving quality.

Contribution

It proposes a novel stage-aware sampling strategy that adapts model capacity based on stage sensitivity, improving efficiency in high-fidelity video generation.

Findings

01

Achieves up to 1.65x faster inference

02

Reduces FLOPs by 57.35%

03

Maintains visual and semantic quality

Abstract

In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies · Advanced Vision and Imaging