Compositional Video Generation via Inference-Time Guidance

Ariel Shaulov; Eitan Shaar; Amit Edenzon; Gal Chechik; Lior Wolf

arXiv:2605.14988·cs.CV·May 15, 2026

Compositional Video Generation via Inference-Time Guidance

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

PDF

TL;DR

This paper introduces CVG, a method that improves compositional accuracy in text-to-video diffusion models by guiding the denoising process with internal attention signals, without retraining the generator.

Contribution

CVG leverages internal cross-attention maps and a lightweight classifier to steer video generation towards better compositional fidelity during inference.

Findings

01

Enhanced compositional faithfulness in generated videos.

02

Maintained visual quality while improving prompt alignment.

03

Transferability of the classifier across related composition labels.

Abstract

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.