Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

Nikolai Warner; Jack Kolb; Meera Hahn; Vighnesh Birodkar; Jonathan; Huang; Irfan Essa

arXiv:2412.10219·cs.CV·December 16, 2024

Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

Nikolai Warner, Jack Kolb, Meera Hahn, Vighnesh Birodkar, Jonathan, Huang, Irfan Essa

PDF

TL;DR

This paper introduces a method for realistic, controllable human image editing in scenes using multimodal conditioning, leveraging a new dataset and pose-aware training to improve identity preservation and interaction realism.

Contribution

It presents a novel dataset and training approach that combines multimodal LLM summaries and pose information for complex non-rigid image edits.

Findings

01

Improved identity preservation in wild scenes

02

Enhanced person-object interaction quality

03

Effective use of noisy captions with pose data

Abstract

In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion · Focus