AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Longhui Yuan

arXiv:2603.14770·cs.CV·March 17, 2026

AnyPhoto: Multi-Person Identity Preserving Image Generation with ID Adaptive Modulation on Location Canvas

Longhui Yuan

PDF

Open Access

TL;DR

AnyPhoto is a novel diffusion-transformer framework that enables multi-person image generation with preserved identities and accurate spatial placement, overcoming shortcuts and enhancing prompt controllability.

Contribution

It introduces a new framework combining spatial grounding, identity-adaptive modulation, and identity-isolated attention for improved multi-person image synthesis.

Findings

01

Improves identity similarity in multi-person generation

02

Reduces copy-paste shortcuts significantly

03

Supports accurate prompt-driven stylization

Abstract

Multi-person identity-preserving generation requires binding multiple reference faces to specified locations under a text prompt. Strong identity/layout conditions often trigger copy-paste shortcuts and weaken prompt-driven controllability. We present AnyPhoto, a diffusion-transformer finetuning framework with (i) a RoPE-aligned location canvas plus location-aligned token pruning for spatial grounding, (ii) AdaLN-style identity-adaptive modulation from face-recognition embeddings for persistent identity injection, and (iii) identity-isolated attention to prevent cross-identity interference. Training combines conditional flow matching with an embedding-space face similarity loss, together with reference-face replacement and location-canvas degradations to discourage shortcuts. On MultiID-Bench, AnyPhoto improves identity similarity while reducing copy-paste tendency, with gains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Face Recognition and Perception