SELU: Self-Learning Embodied MLLMs in Unknown Environments
Boyu Li, Haobin Jiang, Ziluo Ding, Xinrun Xu, Haoran Li, Dongbin Zhao,, and Zongqing Lu

TL;DR
SELU introduces a self-learning actor-critic framework for multimodal large language models, enhancing environmental understanding and decision-making in unknown environments without external feedback, demonstrated in AI2-THOR and VirtualHome.
Contribution
The paper presents a novel self-learning paradigm for MLLMs using actor-critic methods, improving environmental comprehension and decision-making without external feedback.
Findings
Critic improvements of approximately 28% and 30%.
Actor improvements of about 20% and 24%.
Effective self-learning in unknown environments.
Abstract
Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is clearly presented and easy to follow. - The idea is straightforward and seems effective. - Strong and competitive performance.
- The proposed framework is not novel. At its core it's a combination of different techniques which are originally adopted under RL context, such as the actor-critic framework and hindsight relabeling. - Although authors claim the critic learning without any external feedback (e.g., reward) as a strength, I doubt the correctness of resulting critics. For actor-critic methods under RL context, the critic is usually updated with L2 loss between its value estimation and groundtruth. Otherwise, it c
- **Originality**: - As highlighted by the authors in their related work section, there has been extensive work on embodied MLLMs in recent years. The authors distinguish their work by focusing on the self-improvement of these models, instead of relying on (expensive) human feedback signals or environmental rewards. In particular, the authors propose a new architecture that takes inspiration from the actor-critic method in reinforcement learning. Actor-critic methods have been extensively expl
My main concerns with the current version of the work are the following: - In the current version of the work, significant details regarding the evaluation are missing (see "Questions" for details). There is no description of the evaluation methodology of SELU and the baselines, which would fit quite naturally as another subsection of Section 5.1. As such, it is hard to understand what the evaluation metrics are and how they were computed, and thus the full significance of the results. - The pro
1. The paper is clear and well-presented. 2. The experiments are thorough, and the ablation studies are informative. The authors also provide helpful analysis of the limitations and error modes. 3. The proposed method to couple comprehension and decision-making in an unfamiliar environment is interesting.
Just a few questions below.
The paper is well written. It is also a good idea to combine the topics like LLM-based agent and actor-critic style update.
The proposed method seems to be simple and not technically contributing. The experiments environments seem to be limited, some settings remain unclear to me (see below).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
Methods22 Ways to Contact: How Can I Speak to Someone at Expedia · Focus · Self-Learning
