FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan

TL;DR
FOUNDER integrates foundation models with world models to enable open-ended, reward-free embodied decision making, effectively grounding external observations in the agent's internal state for multi-task control.
Contribution
It introduces a novel framework that grounds foundation model representations in world models, enabling goal-conditioned policies learned through imagination in embodied environments.
Findings
Outperforms existing methods on multi-task visual control benchmarks.
Effectively captures deep semantics of tasks from text or videos.
Validates the consistency of the learned reward with ground-truth rewards.
Abstract
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComplex Systems and Decision Making
