MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation
Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, Peng Wu, Guibing Guo, Wei Feng, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Xingwei Wang

TL;DR
This paper introduces MoFu, a novel framework for multi-subject video generation that ensures scale consistency and permutation invariance by using scale-aware modulation and Fourier fusion, improving visual fidelity and naturalness.
Contribution
MoFu is the first unified approach addressing scale inconsistency and permutation sensitivity in multi-subject video synthesis, incorporating LLM-guided modulation and Fourier-based feature fusion.
Findings
MoFu outperforms existing methods in preserving subject scale and fidelity.
The Scale-Permutation Stability Loss enhances generation consistency.
Extensive experiments validate the effectiveness of MoFu on a new benchmark.
Abstract
Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Video Analysis and Summarization
