Generalization Analogies: A Testbed for Generalizing AI Oversight to   Hard-To-Measure Domains

Joshua Clymer; Garrett Baker; Rohan Subramani; Sam Wang

arXiv:2311.07723·cs.AI·December 19, 2023·1 cites

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

Joshua Clymer, Garrett Baker, Rohan Subramani, Sam Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces the GENIES benchmark to evaluate and improve how reward models generalize in AI systems, especially in hard-to-measure domains, highlighting current limitations and potential interpretability techniques.

Contribution

The paper creates a comprehensive benchmark with 69 distribution shifts to test reward model generalization and compares interpretability methods to standard fine-tuning.

Findings

01

Reward models favor internet-like personas over instruction-following.

02

Interpretability techniques outperform fine-tuning in generalization.

03

GENIES benchmark highlights key challenges in reward model generalization.

Abstract

As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. We find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENeralization analogIES (GENIES) benchmark,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joshuaclymer/genies
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Ethics and Social Impacts of AI