ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
Difan Liu, Sandesh Shetty, Tobias Hinz, Matthew Fisher, Richard Zhang,, Taesung Park, Evangelos Kalogerakis

TL;DR
ASSET introduces a transformer-based neural architecture that efficiently edits high-resolution images by sparsifying attention guided by lower-resolution attention, enabling realistic scene modifications and long-range interactions.
Contribution
The paper proposes a novel sparse attention mechanism for transformers that handles high-resolution images efficiently, improving scene editing capabilities.
Findings
Effective high-resolution image editing with scene consistency.
Captures long-range interactions like reflections and landscapes.
Outperforms previous methods in qualitative and quantitative evaluations.
Abstract
We present ASSET, a neural architecture for automatically modifying an input high-resolution image according to a user's edits on its semantic segmentation map. Our architecture is based on a transformer with a novel attention mechanism. Our key idea is to sparsify the transformer's attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions. While previous attention mechanisms are computationally too expensive for handling high-resolution images or are overly constrained within specific image regions hampering long-range interactions, our novel attention mechanism is both computationally efficient and effective. Our sparsified attention mechanism is able to capture long-range interactions and context, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or flora consistent with the rest of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
