MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Run Ling; Ke Cao; Jian Lu; Ao Ma; Haowei Liu; Runze He; Changwei Wang; Rongtao Xu; Yihua Shao; Zhanjie Zhang; Peng Wu; Guibing Guo; Wei Feng; Zheng Zhang; Jingjing Lv; Junjie Shen; Ching Law; Xingwei Wang

arXiv:2512.22310·cs.CV·December 30, 2025

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, Peng Wu, Guibing Guo, Wei Feng, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Xingwei Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces MoFu, a novel framework for multi-subject video generation that ensures scale consistency and permutation invariance by using scale-aware modulation and Fourier fusion, improving visual fidelity and naturalness.

Contribution

MoFu is the first unified approach addressing scale inconsistency and permutation sensitivity in multi-subject video synthesis, incorporating LLM-guided modulation and Fourier-based feature fusion.

Findings

01

MoFu outperforms existing methods in preserving subject scale and fidelity.

02

The Scale-Permutation Stability Loss enhances generation consistency.

03

Extensive experiments validate the effectiveness of MoFu on a new benchmark.

Abstract

Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Video Analysis and Summarization