Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

Jared Joselowitz; Ritam Majumdar; Arjun Jagota; Matthieu Bou; Nyal Patel; Satyapriya Krishna; Sonali Parbhoo

arXiv:2410.12491·cs.CL·October 7, 2025

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning

Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, Sonali Parbhoo

PDF

Open Access

TL;DR

This paper applies inverse reinforcement learning to interpret large language models trained with human feedback, revealing their implicit reward functions and offering insights into their decision-making processes and alignment issues.

Contribution

It introduces a novel IRL-based method to recover LLM reward functions, providing new understanding and tools for improving model alignment and safety.

Findings

01

IRL can recover reward functions with up to 85% accuracy

02

Reward models reveal non-identifiability and size-related interpretability trends

03

IRL-derived rewards can enhance LLM fine-tuning for toxicity benchmarks

Abstract

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law