MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

Shreya Kadambi; Risheek Garrepalli; Shubhankar Borse; Munawar Hyatt; Fatih Porikli

arXiv:2507.13401·cs.CV·July 21, 2025

MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

Shreya Kadambi, Risheek Garrepalli, Shubhankar Borse, Munawar Hyatt, Fatih Porikli

PDF

TL;DR

MADI enhances diffusion models for visual editing by introducing a novel training strategy and inference-time scaling, significantly improving their controllability, compositionality, and editability in image generation tasks.

Contribution

The paper proposes Masking-Augmented Diffusion with Inference-Time Scaling (MADI), combining a new training method and inference mechanism to improve structured visual editing capabilities.

Findings

01

Enhanced editability and controllability of diffusion models.

02

Improved performance with expressive prompts during training.

03

Significant gains in localized and structure-aware image editing.

Abstract

Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.