Referring Layer Decomposition
Fangyi Chen, Yaojie Shen, Lu Xu, Ye Yuan, Shu Zhang, Yulei Niu, Longyin Wen

TL;DR
This paper introduces Referring Layer Decomposition (RLD), a new task for extracting object-aware RGBA layers from images based on user prompts, supported by a large dataset and a baseline model for controllable image editing.
Contribution
The paper presents the RLD task, a large-scale dataset called RefLade, and a baseline model, enabling structured, controllable image editing through prompt-conditioned layer decomposition.
Findings
RefLade contains 1.11 million image-layer-prompt triplets.
RefLayer achieves high visual fidelity and semantic alignment.
The approach generalizes well in zero-shot scenarios.
Abstract
Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset…
Peer Reviews
Decision·ICLR 2026 Poster
- The proposed dataset constitutes a significant improvement over existing resources for this problem domain, both in scale and in the level of curation. The combination of automated and manual verification enhances its overall quality and reliability. - The paper provides thorough evaluations, including analyses of design choices, as well as assessments of the alignment between the proposed metrics and human judgments. - The paper is clearly written, with a well-motivated problem statement and
The paper is overall good, and I didn't find major weaknesses. One minor: For the completion metric, what is the rationale for defining it as the difference between CLIP embeddings, $f(g_\text{rgb}) - f(g_\text{rgb} * g_v)$, rather than directly using the CLIP embedding of the non-visible region, $f(g_\text{rgb} * (1 - g_v))$? The motivation for this specific formulation should be clarified.
- The paper is well-written and easy to follow. The figures and charts help understand the pipeline and the statistics of the dataset very well. - We all know that data is key to the advance of GenAI. This paper proposes a large-scale, real-world dataset for an interesting task. The data generation pipeline involves several SOTA models which guarantee the high quality of the dataset. - The analysis on data scale, data quality (different subsets), and pre-trained models is comprehensive. - I love
I do not see any major weaknesses in the paper. As a paper that defines a new task and proposes a new dataset, every aspect of it is executed very well. Maybe one concern is having more baselines: the paper proposes its own simple baseline which is great, but would it be possible to adapt some prior work's models to this task and benchmark against your proposed method? Another weakness is, can you show some downstream application of the dataset, similar to Sec.4.4 of the MULAN paper. For exampl
1. This paper introduces Referring Layer Decomposition (RLD), the pioneering task that explores layer decomposition guided by multi-modal referring inputs. 2. The authors introduce RefLade, a large-scale dataset of 1.11 million image-layer-prompt triplets built using a scalable data engine. With its human-curated splits for tuning and testing and a well-defined evaluation protocol, RefLade facilitates and paves the way for future RLD studies. RefLayer is also desigend as a simple baseline.
1. More details to ensure the correctness of the image–layer–prompt triplets should be given. In scene understanding, the availabel models for object detection and instance-segmentation can not perform well in all situations, especially for small or occlude objects. How doauthors deal with these cases? 2. It would have been better to show the image distribution with respect to styles, e.g., real images, cartoon, posters and so on, and discuss the model performance in different styles.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Cell Image Analysis Techniques
