EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation
Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, Amy Zhang

TL;DR
EC-Diffuser introduces an entity-centric Transformer with diffusion-based optimization for multi-object manipulation, enabling efficient offline learning and zero-shot generalization to unseen object configurations and goals.
Contribution
The paper presents a novel behavioral cloning approach using object-centric representations and an entity-centric Transformer with diffusion models for improved multi-object manipulation.
Findings
Achieves significant performance improvements in multi-object tasks.
Enables zero-shot generalization to unseen object configurations.
Handles larger numbers of objects than during training.
Abstract
Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into an…
Peer Reviews
Decision·ICLR 2025 Poster
The proposed method leverages object-centric representations to generate action-state sequences for Behavior Cloning and shows that it generalizes to manipulating multiple objects. Experiments are conducted across three different tasks.
The novelty of the proposed method is limited as it combines existing methods on entity-centric representation with diffusion policy. It builds upon the existing method [1] to use diffusion models instead of transformers for generating the action-state sequences for object manipulation. Diffusion models have already been shown to be useful for object manipulation tasks in prior works [2, 3]. [1] Haramati et. al, Entity-Centric Reinforcement Learning for Object manipulation from pixels. [2] Mis
- This idea of predicting object manipulation actions through a diffusion-based architecture is interesting. By incorporating object-level information, the proposed method achieves a performance improvement over baseline methods. - This paper also compare with the state-of-the art non-diffusion baseline, i.e., VQ-BeT, demonstrating the effectiveness of the proposed method.
- The method depends on the capabilities of the image representation algorithm DLP, and experiments are conducted only in synthetic environments. It is unclear if it will perform well in more complex settings. For example, the method is only compared with one non-diffusion baseline VQ-BeT, which is preformed extremely bad on PushCube and PushT (2 out of 3 test environments used in this paper). I wonder why it is only tested on these three environments. Is it possible to also test the model on ot
* __Writing.__ The paper is well-written and easy to follow. The motivation behind addressing the challenges in multi-object manipulation from high-dimensional pixel observations is clearly articulated. * __Novelty.__ The authors present a novel integration of existing approaches by combining object-centric representations (Deep Latent Particles) with a diffusion-based behavioral cloning method. The use of an entity-centric Transformer to handle the unordered nature of latent representations i
* __Alternative object-centric encoders.__ While the authors utilize DLPv2 as a powerful object-centric encoder, the paper would benefit from experimenting with other similar approaches. Since the DLP module appears to be easily replaceable with other object-centric encoders, exploring alternatives could provide insights into the generality and robustness of the proposed method. This would also demonstrate whether the observed benefits are specific to DLPv2 or applicable across different encodin
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Artificial Intelligence in Games
MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Dense Connections · Residual Connection · Diffusion · Multi-Head Attention
