VENUS: Visual Editing with Noise Inversion Using Scene Graphs
Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, and Minh-Triet Tran

TL;DR
VENUS is a training-free, scene graph-guided image editing framework that improves fidelity, semantic consistency, and efficiency by leveraging noise inversion and multimodal large language models without additional training.
Contribution
VENUS introduces a novel, training-free approach for scene graph-guided image editing that combines noise inversion with multimodal large language models, enhancing controllability and scalability.
Findings
Improves background preservation and semantic alignment on PIE-Bench.
Reduces per-image runtime from 6-10 minutes to 20-30 seconds.
Outperforms state-of-the-art scene graph and text-based editing methods.
Abstract
State-of-the-art text-based image editing models often struggle to balance background preservation with semantic consistency, frequently resulting either in the synthesis of entirely new images or in outputs that fail to realize the intended edits. In contrast, scene graph-based image editing addresses this limitation by providing a structured representation of semantic entities and their relations, thereby offering improved controllability. However, existing scene graph editing methods typically depend on model fine-tuning, which incurs high computational cost and limits scalability. To this end, we introduce VENUS (Visual Editing with Noise inversion Using Scene graphs), a training-free framework for scene graph-guided image editing. Specifically, VENUS employs a split prompt conditioning strategy that disentangles the target object of the edit from its background context, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
