Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection
Gihyun Kwon, Jangho Park, Jong Chul Ye

TL;DR
This paper introduces a unified editing framework that leverages shared self-attention in a 2D diffusion model to enable consistent editing across panoramas, 3D scenes, and videos, simplifying multi-modal editing tasks.
Contribution
It proposes a novel self-attention injection method that unifies editing across multiple modalities using a single 2D text-to-image diffusion model.
Findings
Enables consistent editing of videos and 3D scenes using shared self-attention.
Supports editing of panoramic images with semantic consistency.
Demonstrates versatility across diverse visual modalities.
Abstract
While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Advanced Image Processing Techniques
MethodsDiffusion
