LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion
Yisu Zhang, Chenjie Cao, Chaohui Yu, Jianke Zhu

TL;DR
LiON-LoRA introduces a novel LoRA fusion framework that enhances controllability in video diffusion models, enabling precise spatial and temporal video generation with minimal data by leveraging orthogonality, norm consistency, and a controllable token.
Contribution
The paper proposes LiON-LoRA, a new LoRA fusion approach that improves control over spatial and temporal aspects in video diffusion models through three core principles and a modified self-attention mechanism.
Findings
Outperforms state-of-the-art in trajectory control accuracy
Achieves better motion strength adjustment
Demonstrates strong generalization with limited training data
Abstract
Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
