AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation
Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Manyuan Zhang, Ser-Nam Lim, Harry Yang

TL;DR
AlignVid is a training-free method that improves semantic fidelity in text-guided image-to-video generation by reweighting attention maps, effectively handling complex prompt transformations without significant aesthetic loss.
Contribution
The paper introduces AlignVid, a novel training-free framework with attention scaling and guidance scheduling to enhance semantic adherence in TI2V generation, addressing semantic negligence issues.
Findings
AlignVid improves prompt adherence in TI2V tasks.
Attention scaling reduces semantic negligence without aesthetic loss.
Extensive experiments validate the effectiveness of AlignVid.
Abstract
Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii)…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
* Theoretically Motivated Approach: The authors provide an excellent theoretical foundation, analyzing attention maps to observe clear foreground–background separation. They effectively link this empirical finding to an energy perspective, correctly identifying the mechanism as corresponding to a desirable lower-entropy attention distribution. * Excellent Writing and Clarity: The paper is well-written, professional, and easy to follow. * Extensive Validation Across Baselines: The effectiveness
* Lack of Established Evaluation Benchmarks: The proposed AlignVid method is exclusively validated using the proposed OmitI2V benchmark. Including evaluation results on established public benchmarks would significantly strengthen the paper's claims and demonstrate generalizability. For example, the ViCLIP metric [1], which assesses video-text semantic alignment, would be particularly useful to include. * Lack of Generalization Experiments: The proposed AlignVid approach sounds like a general-pu
1. The paper clearly formalizes a critical and prevalent weakness in current TI2V models. The accompanying OmitI2V benchmark is a valuable and necessary tool for the community, as existing benchmarks do not adequately measure this specific failure mode. 2. The paper provides a solid theoretical motivation for why ASM works. Section 4, which links Q/K scaling to inverse temperature control of the softmax and proves its monotonic effect on attention entropy, elevates the method beyond a simple heu
1. The primary metric for semantic fidelity hinges on the performance of a VQA model (Qwen2.5-VL-32B). This introduces a potential point of failure, as the evaluation model may have its own biases or 2. The Block-level Guidance Scheduling (BGS) requires a one-time calibration step that involves using external models (PCA and SAM2) to identify "foreground-sensitive" blocks. This adds a layer of complexity and an external dependency compared to a fully self-contained method. The sensitivity of th
- The paper identifies a common problem in ti2v. The model's tendency to ignore prompts that require significant edits like adding or removing an object. The initial pilot study showing that blurring an image can improve results is a simple but very effective way to motivate the investigation into attention mechanisms. - The proposed method, AlignVid, is simple and practical. It doesn't require any model retraining, making it easy to apply to existing models. - While I have some reservations ab
- The theoretical analysis in Section 4 feels generic and not well-connected to the specific problem the paper aims to solve. The theory explains that scaling attention logits is like temperature scaling, which reduces entropy. However, it doesn't explain why this helps with the specific TI2V problem of balancing an input image with a text prompt. The theory is for a general DiT, but the problem is about a conditioned generation task, and the link between the two is not convincingly made. - The
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Image Enhancement Techniques
