Inference Time Policy Optimization for Offline RL with Differentiable World Models

Rohan Deb; Stephen J. Wright; Arindam Banerjee

arXiv:2603.22430·cs.LG·May 21, 2026

Inference Time Policy Optimization for Offline RL with Differentiable World Models

Rohan Deb, Stephen J. Wright, Arindam Banerjee

PDF

TL;DR

This paper introduces an inference-time policy optimization method for offline reinforcement learning using a differentiable world model, leading to improved performance on continuous control benchmarks.

Contribution

It proposes a novel Differentiable World Model pipeline enabling end-to-end gradient-based policy adaptation during inference in offline RL.

Findings

01

Inference-time optimization improves policy performance on MuJoCo and AntMaze tasks.

02

Exploiting inference-time information yields consistent gains over strong offline RL baselines.

03

A cost-effective variant recovers much of the performance gains with reduced computational expense.

Abstract

Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to *optimize* the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Model Reduction and Neural Networks · Zebrafish Biomedical Research Applications