TL;DR
OcclusionFormer is a novel image generation framework that explicitly models occlusion relationships using a new dataset and a transformer-based approach, improving spatial realism in overlapping objects.
Contribution
The paper introduces SA-Z, a large-scale occlusion-annotated dataset, and OcclusionFormer, a transformer-based model that explicitly encodes Z-order for better occlusion handling in image synthesis.
Findings
Reduces ambiguity in overlapping regions.
Enforces correct occlusion dependencies.
Achieves substantial accuracy improvements.
Abstract
Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
