Learning Complex Non-Rigid Image Edits from Multimodal Conditioning
Nikolai Warner, Jack Kolb, Meera Hahn, Vighnesh Birodkar, Jonathan, Huang, Irfan Essa

TL;DR
This paper introduces a method for realistic, controllable human image editing in scenes using multimodal conditioning, leveraging a new dataset and pose-aware training to improve identity preservation and interaction realism.
Contribution
It presents a novel dataset and training approach that combines multimodal LLM summaries and pose information for complex non-rigid image edits.
Findings
Improved identity preservation in wild scenes
Enhanced person-object interaction quality
Effective use of noisy captions with pose data
Abstract
In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · Focus
