EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
Shuang Ao, Flora D. Salim, Simon Khan

TL;DR
EMAC+ is a novel embodied multimodal agent that enhances LLM-based planning in robotics by integrating visual feedback through VLM, enabling dynamic, environment-aware decision making and improved task performance.
Contribution
This work introduces EMAC+, a bidirectional training framework that allows LLMs to learn from visual interactions, addressing key limitations of prior multimodal agents in robotics.
Findings
Achieves superior performance on ALFWorld and RT-1 benchmarks.
Demonstrates robustness to noisy visual observations.
Enables LLMs to internalize environment dynamics through interaction.
Abstract
Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
N/A
N/A
No
The paper pdf is blank.
None
None
N/A
N/A
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Semantic Web and Ontologies
