LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

Yisu Zhang; Chenjie Cao; Chaohui Yu; Jianke Zhu

arXiv:2507.05678·cs.CV·July 9, 2025

LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

Yisu Zhang, Chenjie Cao, Chaohui Yu, Jianke Zhu

PDF

Open Access

TL;DR

LiON-LoRA introduces a novel LoRA fusion framework that enhances controllability in video diffusion models, enabling precise spatial and temporal video generation with minimal data by leveraging orthogonality, norm consistency, and a controllable token.

Contribution

The paper proposes LiON-LoRA, a new LoRA fusion approach that improves control over spatial and temporal aspects in video diffusion models through three core principles and a modified self-attention mechanism.

Findings

01

Outperforms state-of-the-art in trajectory control accuracy

02

Achieves better motion strength adjustment

03

Demonstrates strong generalization with limited training data

Abstract

Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion