AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

Yexin Liu; Wen-Jie Shu; Zile Huang; Haoze Zheng; Yueze Wang; Manyuan Zhang; Ser-Nam Lim; Harry Yang

arXiv:2512.01334·cs.CV·December 2, 2025

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Manyuan Zhang, Ser-Nam Lim, Harry Yang

PDF

Open Access 3 Reviews

TL;DR

AlignVid is a training-free method that improves semantic fidelity in text-guided image-to-video generation by reweighting attention maps, effectively handling complex prompt transformations without significant aesthetic loss.

Contribution

The paper introduces AlignVid, a novel training-free framework with attention scaling and guidance scheduling to enhance semantic adherence in TI2V generation, addressing semantic negligence issues.

Findings

01

AlignVid improves prompt adherence in TI2V tasks.

02

Attention scaling reduces semantic negligence without aesthetic loss.

03

Extensive experiments validate the effectiveness of AlignVid.

Abstract

Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii)…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 4Confidence 4

Strengths

* Theoretically Motivated Approach: The authors provide an excellent theoretical foundation, analyzing attention maps to observe clear foreground–background separation. They effectively link this empirical finding to an energy perspective, correctly identifying the mechanism as corresponding to a desirable lower-entropy attention distribution. * Excellent Writing and Clarity: The paper is well-written, professional, and easy to follow. * Extensive Validation Across Baselines: The effectiveness

Weaknesses

* Lack of Established Evaluation Benchmarks: The proposed AlignVid method is exclusively validated using the proposed OmitI2V benchmark. Including evaluation results on established public benchmarks would significantly strengthen the paper's claims and demonstrate generalizability. For example, the ViCLIP metric [1], which assesses video-text semantic alignment, would be particularly useful to include. * Lack of Generalization Experiments: The proposed AlignVid approach sounds like a general-pu

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper clearly formalizes a critical and prevalent weakness in current TI2V models. The accompanying OmitI2V benchmark is a valuable and necessary tool for the community, as existing benchmarks do not adequately measure this specific failure mode. 2. The paper provides a solid theoretical motivation for why ASM works. Section 4, which links Q/K scaling to inverse temperature control of the softmax and proves its monotonic effect on attention entropy, elevates the method beyond a simple heu

Weaknesses

1. The primary metric for semantic fidelity hinges on the performance of a VQA model (Qwen2.5-VL-32B). This introduces a potential point of failure, as the evaluation model may have its own biases or 2. The Block-level Guidance Scheduling (BGS) requires a one-time calibration step that involves using external models (PCA and SAM2) to identify "foreground-sensitive" blocks. This adds a layer of complexity and an external dependency compared to a fully self-contained method. The sensitivity of th

Reviewer 03Rating 4Confidence 3

Strengths

- The paper identifies a common problem in ti2v. The model's tendency to ignore prompts that require significant edits like adding or removing an object. The initial pilot study showing that blurring an image can improve results is a simple but very effective way to motivate the investigation into attention mechanisms. - The proposed method, AlignVid, is simple and practical. It doesn't require any model retraining, making it easy to apply to existing models. - While I have some reservations ab

Weaknesses

- The theoretical analysis in Section 4 feels generic and not well-connected to the specific problem the paper aims to solve. The theory explains that scaling attention logits is like temperature scaling, which reduces entropy. However, it doesn't explain why this helps with the specific TI2V problem of balancing an input image with a text prompt. The theory is for a general DiT, but the problem is about a conditioned generation task, and the link between the two is not convincingly made. - The

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Image Enhancement Techniques