BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

Yitang Li; Zhengyi Luo; Tonghe Zhang; Cunxi Dai; Anssi Kanervisto; Andrea Tirinzoni; Haoyang Weng; Kris Kitani; Mateusz Guzek; Ahmed Touati; Alessandro Lazaric; Matteo Pirotta; Guanya Shi

arXiv:2511.04131·cs.RO·November 7, 2025

BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta, Guanya Shi

PDF

Open Access 1 Models 3 Reviews

TL;DR

BFM-Zero introduces a promptable behavioral foundation model for humanoid robots that learns a shared latent space enabling versatile, zero-shot, and few-shot control across multiple tasks in real-world settings.

Contribution

It presents BFM-Zero, a novel framework that unifies diverse humanoid control tasks into a single, promptable policy using unsupervised reinforcement learning and a structured latent space.

Findings

01

Enables zero-shot motion tracking, goal reaching, and reward optimization on a real humanoid.

02

Demonstrates robustness and versatility of the learned policy in real-world experiments.

03

Quantitative ablations show effectiveness of design choices in bridging the sim-to-real gap.

Abstract

Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward optimization, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- The method contrasts the dominant PPO over a motion dataset approach and demonstrates that it works. This may present a path forward to locomotion models trained with less reward engineering. - Smooth and structured latent space is likely helpful for multimodal prompting or other downstream tasks. - Pretraining seems scalable, subject to limits of the simulation model. -Real world sim-to-real deployment of off policy-trained model.

Weaknesses

- "Foundation model" itself may be overclaiming since the model is essentially just a low level controller without rich vision or touch sensing. - Prompting occurs in a human-uninterpretable latent space rather than something like language. - More detailed comparisons of computational cost and representation quality with typical on-policy methods would be more convincing to validate the usefulness of such an approach. - Policy quality is still upper-bounded by simulation environment

Reviewer 02Rating 4Confidence 4

Strengths

## Strengths - Real World Demonstrations: Impressive zero-shot performance on the Unitree G1, including balance maintenance, push recovery from large perturbations, and handling a 4 kg payload, showcases practical sim-to-real transfer. - Clear Description of Method: The FB backbone, critics, reward shaping, and auxiliary components are explained adequately, though familiarity with prior FB-CPR work is helpful for full understanding. - Thorough Ablations: Quantitative evaluations of privileged v

Weaknesses

## Weaknesses - Lack of Competitive Baselines: The paper does not compare against state-of-the-art methods like Ex-Body 2, OmniH2O, or Puppeteer on identical tasks and metrics, making it hard to gauge relative advancements. - Under-Quantified Latent Space Claims: T-sne plot of latents looks interpretable, But assertions about the "promptable" and semantic nature of the latent space are not backed by quantitative measures, weakening the foundation model claims.

Reviewer 03Rating 6Confidence 4

Strengths

- The authors show promising results for Zero-shot RL on humanoids. The video presents target pose tracking and continuous dancing, as well as few-shot adaptation in the real world on a challenging hopping example. - The results are interesting. Especially the natural recovery, the few-shot adaptation, and motions with whole-body contact are impressive.

Weaknesses

In general, the paper seems rushed and contains many sections with obvious errors (see below for minor issues caught during the read). This is also true for the evaluation, which raises more questions than it answers (see the question section). - **Major:** The main contribution of this work is the presentation of a method that works on a real humanoid. However, beyond the few-shot adaptation task, there is little quantitative evaluation in the real world. Furthermore, the results presented on

Code & Models

Models

🤗
LeCAR-Lab/BFM-Zero
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Robotic Locomotion and Control