Residual-MPPI: Online Policy Customization for Continuous Control
Pengcheng Wang, Chenran Li, Catherine Weaver, Kenta Kawamoto,, Masayoshi Tomizuka, Chen Tang, Wei Zhan

TL;DR
Residual-MPPI is an online planning algorithm that enables real-time customization of continuous control policies without retraining, effectively adapting to new metrics and scenarios such as high-level racing tasks.
Contribution
The paper introduces Residual-MPPI, a novel online planning method that allows zero-shot and few-shot policy customization in continuous control tasks without access to original training data.
Findings
Successfully customized a racing agent in GTS environment
Effective in zero-shot and few-shot online policy adaptation
Code and demos available online
Abstract
Policies developed through Reinforcement Learning (RL) and Imitation Learning (IL) have shown great potential in continuous control tasks, but real-world applications often require adapting trained policies to unforeseen requirements. While fine-tuning can address such needs, it typically requires additional data and access to the original training metrics and parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time, which we call Residual-MPPI. It can customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings, given access to…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is well-written and easy to follow. - The methodology is novel and well-structured. - The approach is validated in complex environments, including MuJoCo and GTS.
- The paper lacks SOTA baselines for policy customization. For instance, methods like those in [1, 2] demonstrate few-shot adaptability to new environments without an additional parameter training phase. A comparison with such method would strengthen the evaluation. [1] Xu, Mengdi, et al. "Prompting decision transformer for few-shot policy generalization." *international conference on machine learning*. PMLR, 2022. [2] Liu, Jinxin, et al. "Ceil: Generalized contextual imitation learning." *Adv
* [Originality] The paper attempts to integrate the residual Q learning into the Model Predictive Path Integral (MPPI) and show some promising results of such attempt. * [Quality and clarity] The paper is well written, and the structure is good in general. It is good to see the empirical evaluation has been conducted on multiple different domains and a relatively thorough results are reported. * [Significance] The proposed method considers an important problem of adapting a policy to new setti
### [Empirical evaluation appears to be weak] 1. The empirical performance of the proposed method, Residual-MPPI, is quite similar to the heuristic modification of MPPI, i.e., Greedy-MPPI. On the main metrics, Full-task and Add-on Task, the reported performance of Residual-MPPI is approximately the same as that of Greedy-MPPI. The paper pointed out one difference on Ant Full Task. But it is unclear how the total reward was computed here and why this total reward should matter more than the add-
- The work is well written and structured, easy to follow. The adopted motivation and proposed method are also clear; - The evaluation setup is interesting: it brings MuJoCo benchmarks (which, although toy problems, are not trivial for continuous control) and also on GTS (which is a more complex setup for control). Therefore, the presented results are grounded in solid benchmarks. - The discussion in Section 4.2 is very elucidative, and convincingly justifies the failures of the baselines in c
While my concerns are not major, there are some suggestions that could improve the clarity of the results in the paper: - In Section 5.1, the paper argues that the metrics to evaluate on GTS are lap time and off-course steps since the rewards are complex. Nonetheless, I believe the work should provide both the prior task reward and the add-on reward for each method, as in the MuJoCo case, to understand how the method influences the returns that are optimized. Perhaps the considered small change
Videos
Taxonomy
TopicsSecurity and Verification in Computing · Simulation Techniques and Applications
