BachVid: Training-Free Video Generation with Consistent Background and Character
Han Yan, Xibin Song, Yifu Wang, Hongdong Li, Pan Ji, Chao Ma

TL;DR
BachVid is a training-free method that ensures consistent background and character in generated videos by leveraging DiT's attention mechanism to reuse intermediate features, eliminating the need for reference images or training.
Contribution
It introduces the first training-free approach for consistent video generation that exploits DiT's features to maintain background and character consistency across multiple videos.
Findings
Achieves robust background and character consistency in generated videos.
Does not require reference images or additional training.
Leverages attention mechanisms to reuse intermediate features.
Abstract
Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper tackles an practical problem: generating videos with consistency in both the character and the background - The systematic analysis of the video DiT internal mechanisms is a nice contribution. Pinpointing which layers and timesteps are most effective for foreground mask extraction, point matching, and key-value injection provides valuable insights that could be useful for other related video generation and editing tasks.
- The qualitative results, particularly the video quality shown in the supplementary material, appear quite low and suffer from artifacts. Since the method is presented as a general, training-free technique, it is surprising that it wasn't demonstrated on more powerful, state-of-the-art open models (e.g., Wan 2.2). Testing only on CogVideoX-5B makes it hard to judge if the method is truly generalizable or if its effectiveness is limited to a specific model architecture. - The motivation for such
1. The paper addresses a concrete and useful gap: multi-video consistency (both background and character) without training or reference images. 2. Systematic analysis of which DiT layers / timesteps encode masks and correspondences is useful and could inform other work. 3. The vital layer selection to bound cached KV storage is practical.
1. The core mechanism (cache keys/values from an identity and inject later) closely follows prior image methods (e.g., CharaConsist), the novelty for video is primarily empirical and in heuristics (layer, timestep selection, mapping). The paper should more clearly state what is fundamentally new versus prior work and justify why the video extension is non-trivial. 2. Choices such as the first 15 layers and \tau_{mask}, \tau_{match} appear ad hoc. The paper lacks principled selection criteria, se
The proposed method is entirely training-free and does not rely on any reference images. While ensuring high efficiency, it simultaneously addresses the dual-consistency challenge of both characters and backgrounds.
1. In quantitative experiments, BachVid does not surpass existing identity-preserving methods (such as TFIGE and ConsisID) on the identity consistency metric (Face-Arc). Can BachVid's background consistency mechanism be combined with the character consistency mechanisms of these methods, and would the resulting performance be better than that of the proposed approach? 2. Although BachVid requires no training, it relies on a static "identity video" as a template. When the actions in subsequent v
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
