Robustness in the Face of Partial Identifiability in Reward Learning

Filippo Lazzati; Alberto Maria Metelli

arXiv:2501.06376·cs.LG·September 16, 2025

Robustness in the Face of Partial Identifiability in Reward Learning

Filippo Lazzati, Alberto Maria Metelli

PDF

Open Access 3 Reviews

TL;DR

This paper addresses the challenge of partial reward identifiability in Reward Learning by proposing a robust framework that maximizes worst-case performance, with theoretical guarantees and a practical algorithm called Rob-ReL.

Contribution

It introduces a general framework for quantifying performance loss due to reward ambiguity and develops Rob-ReL, a robust algorithm with theoretical complexity guarantees.

Findings

01

Rob-ReL effectively handles reward ambiguity in preference-based reward learning.

02

Theoretical guarantees on sample and iteration complexity for Rob-ReL.

03

Proof-of-concept experiments demonstrate robustness in partial identifiability scenarios.

Abstract

In Reward Learning (ReL), we are given feedback on an unknown target reward, and the goal is to use this information to recover it in order to carry out some downstream application, e.g., planning. When the feedback is not informative enough, the target reward is only partially identifiable, i.e., there exists a set of rewards, called the feasible set, that are equally plausible candidates for the target reward. In these cases, the ReL algorithm might recover a reward function different from the target reward, possibly leading to a failure in the application. In this paper, we introduce a general ReL framework that permits to quantify the drop in "performance" suffered in the considered application because of identifiability issues. Building on this, we propose a robust approach to address the identifiability problem in a principled way, by maximizing the "performance" with respect to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

I think the paper has a number of strengths: 1. The paper addresses an *important problem*: improving the safety of reward learning, which is a salient problem for modern AI deployments. 2. The *novel formulation* of robustness for reward learning adds upon prior work addressing this problem. The uninformativeness measure is an interesting way to quantify the difficulty of doing reward learning. 3. The formulation is *very general* and explicitly considers three important kinds of reward learni

Weaknesses

### Scalability The proposed algorithm is polynomial in the size of the state space. This is a limited weakness, as it is common to some of the prior literature. However, some reward learning methods have been shown to work in realistic and large or continuous state spaces. For example, Christiano et al.'s (2017) reward learning has been applied to text settings (large, discrete state spaces) and physics simulations (continuous state spaces). Laidlaw et al. (2025) applied CIRL to a large game.

Reviewer 02Rating 8Confidence 5

Strengths

**Rigorous Theoretical Analysis.** Theorem 5.3 provides polynomial sample and iteration complexity bounds under reasonable assumptions (Slater's condition). The proof technique combining visitation distribution estimation errors with primal-dual subgradient convergence is sound. The use of RF-Express for minimax-optimal reward-free exploration is appropriate. **Clear Presentation and Organization.** The paper is well-structured with motivation, framework, approach, algorithm, and theory present

Weaknesses

**Limited treatment of function approximation.** The tabular setting with explicit state-action representation limits applicability to high-dimensional problems. While the authors mention neural network parameterization in related work, Rob-ReL does not incorporate function approximation

Reviewer 03Rating 8Confidence 3

Strengths

- The formalization, explanation, and clarity of the ReL problem as a pair $(\mathcal{F},g)$ is very well written, general, relevant, and well positioned against related work. A key piece is how the ReL problem is reformulated to finding the optimal object $x\in\mathcal{X}_g$ for some application $g$, when given the uncertainty set of rewards $\mathcal{R_F}$. - The metric used to calculate the loss, $\mathcal{I}_{\mathcal{F},g}$, or how "uninformative" $\mathcal{F}$ is for application $g$, is ve

Weaknesses

- While the authors present a powerful, general framework, they do not address tractability concerns. The limitations section significantly downplays this fact and speaks broadly on it. - The experiments, though useful in demonstrating/verifying the author's theoretical, do not have any baselines to compare against. For example, in section 4, existing approaches are discussed. It would be interesting to see how the proposed algorithm compares against these non-robust baselines.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Neural Networks and Applications

MethodsSparse Evolutionary Training