PROF: An LLM-based Reward Code Preference Optimization Framework for Offline Imitation Learning

Shengjie Sun; Jiafei Lyu; Runze Liu; Mengbei Yan; Bo Liu; Deheng Ye; Xiu Li

arXiv:2511.13765·cs.LG·November 19, 2025

PROF: An LLM-based Reward Code Preference Optimization Framework for Offline Imitation Learning

Shengjie Sun, Jiafei Lyu, Runze Liu, Mengbei Yan, Bo Liu, Deheng Ye, Xiu Li

PDF

Open Access 4 Reviews

TL;DR

PROF introduces a novel LLM-based framework for optimizing reward functions in offline imitation learning, automating reward refinement without environment interaction, and demonstrating superior performance on benchmark datasets.

Contribution

The paper presents PROF, a new framework that uses LLMs to generate and refine reward functions from natural language and a single expert trajectory, with a novel ranking strategy called RPR.

Findings

01

PROF outperforms recent baselines on D4RL datasets.

02

RPR effectively assesses reward function quality without environment interaction.

03

The approach automates reward optimization, reducing reliance on manual tuning.

Abstract

Offline imitation learning (offline IL) enables training effective policies without requiring explicit reward annotations. Recent approaches attempt to estimate rewards for unlabeled datasets using a small set of expert demonstrations. However, these methods often assume that the similarity between a trajectory and an expert demonstration is positively correlated with the reward, which oversimplifies the underlying reward structure. We propose PROF, a novel framework that leverages large language models (LLMs) to generate and improve executable reward function codes from natural language descriptions and a single expert trajectory. We propose Reward Preference Ranking (RPR), a novel reward function quality assessment and ranking strategy without requiring environment interactions or RL training. RPR calculates the dominance scores of the reward functions, where higher scores indicate…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

* **Originality:** Introduces a formal reward-ranking signal (RPR) that does not require environment interaction; conceptually novel for offline IL. * **Empirical quality:** Evaluated on three diverse D4RL domains with multiple seeds and std devs reported. Competitive or superior to recent baselines (SEABO, OTR) under standard normalized-score protocols. * **Transparency:** Provides clear algorithmic structure (generation → ranking → optimization) and example prompts; includes iteration ablation

Weaknesses

1. **Missing component ablations:** No sensitivity study for key RPR hyperparameters $(\delta, \alpha_o, \alpha_a, H)$ or the noise-perturbation mechanism. This limits understanding of what drives performance. 2. **Reward hacking risk:** At iteration $T=3$, the LLM fabricates variables, leading to spurious but high-ranked rewards. Indicates misalignment between ranking score and real performance. 3. **Baseline parity unclear:** Baseline results (SEABO, OTR) seem taken from prior work rather than

Reviewer 02Rating 2Confidence 4

Strengths

- The paper automates reward function generation and optimization using LLMs without requiring environment interaction. - It achieves strong empirical performance, surpassing or matching baselines across D4RL tasks.

Weaknesses

Offline algorithms make sense when interactions with the environment is costly. Here, to prompt the LLM for the reward functions, the environment information is fed into LLM. In the examples in the paper, this seems to be just the observation and action space (according to Appendix C.2), which is fine. But I am not sure if this amount of information would be sufficient in more realistic and complex environments. It may be needed to input the dynamics as well (e.g., step function in the gym conve

Reviewer 03Rating 2Confidence 4

Strengths

The authors evaluated the algorithm on multiple environments as well as using multiple LLMs.

Weaknesses

1. The authors compare their method only against non–state-of-the-art algorithms [1,2] in this field. When considering more recent work in this area [1,2], the proposed method would likely underperform. Therefore, the claim of outperforming all baselines appears to be based on a selective choice of comparisons, which gives the impression of cherry-picking baselines to favor their approach. 2. The authors mention that a single iteration is sufficient for full training. In that case, how does thi

Reviewer 04Rating 4Confidence 3

Strengths

1. By leveraging LLMs to generate and improve executable reward function codes is promising, especially for the offline RL setting. 2. Experimental results show that PROF achieves similar or better performance against some baselines on the D4RL benchmark. 3. The integration of LLM, preference ranking, and textual optimization for addressing the reward design issue is novel and effective.

Weaknesses

1. While experimental results on some cases demonstrate the effectiveness of the presented framework, theoretical guarantees are lacking for the performance improvements. 2. In the paper, only one expert trajectory is utilized, which is in contrast to reality, where optimal behaviors may be diverse and not necessarily proximal to a limited set of demonstrations (as also claimed in the paper). Experiments with multiple expert demonstrations are suggested. 3. With the help of LLM, some prior kno

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification