TL;DR
StructDiff is a novel diffusion-based framework for single-image generation that preserves structure, offers spatial control via positional encoding, and introduces a new evaluation criterion, outperforming existing methods.
Contribution
It introduces an adaptive receptive field module, employs 3D positional encoding for spatial control, and proposes a new LLM-based evaluation criterion for single-image generation.
Findings
Outperforms existing methods in structural consistency and visual quality.
Enables flexible spatial control over generated content.
Demonstrates broad applicability across various image synthesis tasks.
Abstract
This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
