Compositional Video Generation via Inference-Time Guidance
Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

TL;DR
This paper introduces CVG, a method that improves compositional accuracy in text-to-video diffusion models by guiding the denoising process with internal attention signals, without retraining the generator.
Contribution
CVG leverages internal cross-attention maps and a lightweight classifier to steer video generation towards better compositional fidelity during inference.
Findings
Enhanced compositional faithfulness in generated videos.
Maintained visual quality while improving prompt alignment.
Transferability of the classifier across related composition labels.
Abstract
Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
