Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Weidong Huang; Zhehan Li; Hangxin Liu; Biao Hou; Yao Su; Jingwen Zhang

arXiv:2601.21363·cs.RO·February 24, 2026

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Weidong Huang, Zhehan Li, Hangxin Liu, Biao Hou, Yao Su, Jingwen Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that large-scale pretraining with off-policy RL and subsequent model-based fine-tuning can effectively bridge the gap between pretraining and efficient adaptation for humanoid control, enabling zero-shot deployment and safe environment adaptation.

Contribution

It introduces a method combining large-batch off-policy RL pretraining with model-based fine-tuning for humanoids, improving sample efficiency and deployment safety.

Findings

01

SAC with high UTD ratio supports large-scale pretraining.

02

Pretrained policies enable zero-shot deployment on real robots.

03

Model-based fine-tuning improves adaptation to new environments.

Abstract

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* The paper provides a valuable finding by demonstrating that SAC, an off-policy algorithm, can be successfully scaled for massively parallel pre-training of humanoid controllers. The open-sourcing of a JAX-based SAC implementation for this purpose is a welcome contribution. * The proposed hybrid physics-informed world model works and the target environment adaptation works well for humanoid simulation. The ablation study in Appendix A.3 (Fig. 6) shows that a standard "black-box" MBPO-style mod

Weaknesses

* Lack of Real-World Fine-Tuning (Sim-to-Sim): The paper's primary weakness is that its core claim—safe, efficient fine-tuning—is only validated in a "sim-to-sim" setting (MuJoCo to Brax). While the paper shows zero-shot *pre-training* on a real robot (Appendix A.6), it does not "close the loop" and test the *fine-tuning* procedure in the real world. The claim of "safety" is significantly weaker without this, as collecting even "deterministic" data in a new real-world environment carries risks (

Reviewer 02Rating 6Confidence 3

Strengths

- The proposed pipeline is systematically motivated, decomposing the challenge of bridging data-hungry RL pretraining with safety and efficiency in real-world adaptation. - The integration of Lagrangian dynamics with learned residuals for improved world model rollouts is novel in the humanoid domain and addresses both sample efficiency and stability concerns. - The work provides results for two complex humanoid platforms (Booster T1 and Unitree G1), encompassing flat and rough terrain, various

Weaknesses

- **Task Diversity**: The evaluation focuses on forward locomotion tasks (varying speed targets on flat or rough terrain). This is a narrow slice of humanoid skills. The significance would be further bolstered by testing, say, different locomotion gaits or disturbances, or tasks like turning, obstacle avoidance, etc. It’s unclear how readily the approach extends to non-locomotion tasks or more complex objectives. - **Use of Privileged Information**: The method relies on a privileged state (inclu

Reviewer 03Rating 6Confidence 3

Strengths

> The authors provide a fast, scalable implementation of SAC, which opens up new avenues for research on the limitations of online reinforcement learning > The authors conduct extensive studies to verify the efficacy of their algorithm. I appreciated their use of 8 seeds, which is much better than the typical practice in other deep RL research (as few as three seeds in most works I have reviewed). The inclusion of zero-shot transfer in real robotics was also a promising finding, although no expe

Weaknesses

Our biggest concern is the quality of organization and writing in the current submission. A good example of this is the exclusion of results to answer Q3 in the experiment section. If the paper had reduced content in other sections, it would have been easier to include them in the main paper. For example, Comments on design choices in the method section can be moved to the appendix, such as those about the experimental evidence used to justify decisions, which should be explained in the Q3 resul

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Locomotion and Control · Human Motion and Animation · Reinforcement Learning in Robotics