Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Mohamed Salim Aissi; Clement Romac; Thomas Carta; Sylvain Lamprier; Pierre-Yves Oudeyer; Olivier Sigaud; Laure Soulier; Nicolas Thome

arXiv:2410.19920·cs.LG·September 8, 2025

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting

Mohamed Salim Aissi, Clement Romac, Thomas Carta, Sylvain Lamprier, Pierre-Yves Oudeyer, Olivier Sigaud, Laure Soulier, Nicolas Thome

PDF

Open Access 1 Video

TL;DR

This paper investigates how reinforcement learning fine-tuning affects large language models' robustness to prompt variations in interactive environments, revealing sensitivity issues and proposing contrastive loss to enhance generalization.

Contribution

It introduces a framework for analyzing prompt sensitivity after RL training and proposes a contrastive loss method to mitigate overfitting to specific prompts.

Findings

01

LLMs' performance degrades with unseen prompt formulations

02

Prompt sensitivity is linked to internal representations and salient tokens

03

Contrastive loss improves robustness and generalization

Abstract

Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks. However, few studies have thoroughly investigated the impact on LLM agents capabilities of fine-tuning them with RL in a specific environment. In this paper, we propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment. Our findings reveal that the performance of LLMs degrades when faced with prompt formulations different from those used during the RL training phase. Besides, we analyze the source of this sensitivity by examining the model's internal representations and salient tokens. Finally, we propose to use a contrastive loss to mitigate this sensitivity and improve the robustness and generalization capabilities of LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reinforcement Learning for Aligning Large Language Models Agents with Interactive Environments: Quantifying and Mitigating Prompt Overfitting· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems