Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Yuanyang Yin; Yufan Deng; Shenghai Yuan; Kaipeng Zhang; Xiao Yang; Feng Zhao

arXiv:2601.07287·cs.CV·January 13, 2026

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

Yuanyang Yin, Yufan Deng, Shenghai Yuan, Kaipeng Zhang, Xiao Yang, Feng Zhao

PDF

Open Access

TL;DR

This paper introduces Focal Guidance, a method to improve controllability and text adherence in video diffusion models by enhancing weak semantic layers through region anchoring and attention transfer, validated on a new benchmark.

Contribution

It identifies Semantic-Weak Layers in Diffusion Transformer-based models and proposes Focal Guidance to strengthen their semantic responses, improving instruction following in video generation.

Findings

01

Focal Guidance increases instruction adherence scores by up to 7.44%.

02

It effectively couples visual and textual guidance in diffusion models.

03

The introduced benchmark assesses instruction following in I2V models.

Abstract

The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Multimodal Machine Learning Applications