MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

Yanfeng Li; Yue Sun; Keren Fu; Sio-Kei Im; Xiaoming Liu; Guangtao Zhai; Xiaohong Liu; and Tao Tan

arXiv:2601.05546·cs.CV·January 12, 2026

MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

Yanfeng Li, Yue Sun, Keren Fu, Sio-Kei Im, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu, and Tao Tan

PDF

Open Access

TL;DR

MoGen introduces a flexible, multi-object image generation framework that accurately aligns language descriptions with image regions and adaptively integrates multiple control signals for precise, customizable scene creation.

Contribution

The paper presents MoGen, a novel framework with RSA and AMG modules that improve multi-object image generation by enhancing alignment, control flexibility, and scene consistency.

Findings

01

Outperforms existing methods in generation quality

02

Achieves higher quantity consistency of objects

03

Provides superior control flexibility and fine-grained scene customization

Abstract

Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Computer Graphics and Visualization Techniques