Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

Xinyi Yang; Liang Zeng; Heng Dong; Chao Yu; Xiaoran Wu; Huazhong Yang; Yu Wang; Milind Tambe; Tonghan Wang

arXiv:2502.12530·cs.CL·February 13, 2026

Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang

PDF

Open Access 3 Reviews

TL;DR

This paper presents a novel framework that uses flow matching with generative normalizing flows to train language models for explanations, improving alignment with human judgments and decision accuracy.

Contribution

It introduces a general-purpose reinforcement learning approach using distributional rewards from CNFs, capturing the probabilistic nature of human explanations and bounding deviations from true rewards.

Findings

01

Explanations generated improve prediction accuracy of agent decisions.

02

Explanations exhibit greater logical soundness and are more actionable.

03

Lower cognitive load compared to baseline methods.

Abstract

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in the decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

* I like the formulation of the rectified flow model for modeling rewards. In general, the setup does make sense to me for training a model to generate explanations of other policies, and the pluralistic/probabilistic view of explanations is reasonable. * Outside of the choice of the benchmarks (see weakness below), I appreciate the thoroughness of the empirical results. The ablations over different backbones and rewards (Table 7-10) help better understand the method’s performance and the genera

Weaknesses

* I think the framing is a bit confusing. The stated goal is to explain a policy’s decision, where the policy is that of an RL agent. But the only actual “policy” that is explained is for the SMAC benchmark, with MMLU and MathQA being language-based multiple choice questions where the agent’s “policy” is simply selecting a single action out of a set. Thus, for these benchmarks there is no sequential decision making aspect (unless we are referring to sequential prediction of tokens, which doesn’t

Reviewer 02Rating 4Confidence 4

Strengths

- The use of a language conditioned rectified flow as a reward model is novel. - The empirical section is broad, with many benchmarks and baselines. - Human evaluation is convincing.

Weaknesses

1. The positioning of the paper is hard to grasp. Stripped of its technical machinery, the method is essentially training a model that, given only the current context or prompt, produces a reasoning trace without ever emitting the final decision/action/answer. It is therefore unclear what the concrete utility of such an “Explanation LLM” is. The title “translate policy to language” confuse the reviewer that the goal is to explain an existing policy’s decision process, but the actual pipeline tra

Reviewer 03Rating 8Confidence 4

Strengths

Strength: 1 The topic is interesting, and to my knowledge, this is the first work that applies diffusion model to fit the reward for LLM explanations. 2 The experiments are abundant. 3 The paper provides a thorough analysis of the proposed algorithm, including an error bound for recovering the true human reward distribution (Theorem 1), which reinforces the soundness of the approach.

Weaknesses

Weakness: 1 In Section 3.1, the authors approximate the likelihood of producing multiple actions by averaging token logits. This approximation may introduce significant estimation error, especially when a single abnormal token contributes disproportionately. 2 The framework does not address the issue of long-term reward accumulation, which is a key consideration in RL-based settings.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Business Process Modeling and Analysis