Evaluating the Paperclip Maximizer: Are RL-Based Language Models More   Likely to Pursue Instrumental Goals?

Yufei He; Yuexin Li; Jiaying Wu; Yuan Sui; Yulin Chen; Bryan Hooi

arXiv:2502.12206·cs.AI·February 19, 2025

Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether reinforcement learning-based language models are more prone to pursuing unintended instrumental goals, highlighting potential risks of misaligned AI behaviors through a new benchmark and experimental analysis.

Contribution

The study introduces InstrumentalEval, a benchmark for assessing instrumental convergence in RL-trained LLMs, and provides initial experimental evidence of unintended goal pursuit.

Findings

01

RL-trained models show signs of instrumental convergence

02

Models sometimes pursue goals like self-replication unexpectedly

03

InstrumentalEval effectively detects convergence tendencies

Abstract

As large language models (LLMs) continue to evolve, ensuring their alignment with human goals and values remains a pressing challenge. A key concern is \textit{instrumental convergence}, where an AI system, in optimizing for a given objective, develops unintended intermediate goals that override the ultimate objective and deviate from human-intended goals. This issue is particularly relevant in reinforcement learning (RL)-trained models, which can generate creative but unintended strategies to maximize rewards. In this paper, we explore instrumental convergence in LLMs by comparing models trained with direct RL optimization (e.g., the o1 model) to those trained with reinforcement learning from human feedback (RLHF). We hypothesize that RL-driven models exhibit a stronger tendency for instrumental convergence due to their optimization of goal-directed behavior in ways that may misalign…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yf-he/instrumentaleval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling