MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Kai Fronsdal; David Lindner

arXiv:2412.03904·cs.AI·December 6, 2024

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Kai Fronsdal, David Lindner

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a suite of evaluation tasks to measure the instrumental self-reasoning abilities of large language models, highlighting their emergence in advanced models and the importance of context in their capabilities.

Contribution

It presents novel evaluation methods for assessing instrumental self-reasoning in LLMs across diverse scenarios, addressing limitations of prior non-agentic assessments.

Findings

01

Instrumental self-reasoning emerges only in the most capable models.

02

Self-reasoning ability is highly context-dependent.

03

No current models pass the most challenging evaluations.

Abstract

We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

I generally liked the discussion in the paper. Discussion on self-reasoning is important and relates to other fields like Theory of Mind, human-AI etc. which are topical. I agree that self-reasoning is going to be very important for autonomous agents and the paper makes an effort in that direction.

Weaknesses

While I like the direction of the paper, I'm not convinced that the evaluations correspond to self-evaluation. (I'm hoping for some discussion with the author to clarify) : Parts that may not be "self-reasoning" : - Discuss why is Tool Improvement part of self-reasoning. In this case LLM is acting as a verifier / checker of an external system. I feel access to deceptive tools is a second-order concern for current LLMs (i.e. LLMs should be aware of when tools are deceptive), but even that doe

Reviewer 02Rating 8Confidence 3

Strengths

Strengths: + This work studies an important area of research: evaluating self-reasoning capabilities of LLMs + The paper contains several well-designed tasks and measures for evaluating self-reasoning abilities of LLMs. As state-of-the-art LLMs perform poorly in this suite, this suite can be used as a testbed to test these complex reasoning abilities.

Weaknesses

Weaknesses: - Many of the specific details regarding the prompt and task design are found in the appendix. It would be beneficial to move some of this material in the main paper to improve readability. - Could you expand on how the results in the Opaque Reasoning Section relate to Figures 2 and 3? - Is there a larger takeaway or recommendation toward future LLM research that these results imply? Does poor current self-reasoning capabilities mean that a human should be in the loop to ensure bett

Reviewer 03Rating 5Confidence 3

Strengths

1. This work is highly relevant, and research along these lines will be crucially important as frontier models continue to progress. The authors have identified an important area to investigate and are thinking along good lines. 2. The writing is concise and informative. 3. I like the thorough results in appendix, and a detailed run-through of the prompts used is helpful for understanding exactly what's going on.

Weaknesses

1. I am concerned as to how novel this work is given Phuong et al’s “Evaluating Frontier Models for Dangerous Capabilities”. The “self modification” task is not novel, whilst “tool improvement” is better but seems slightly unnecessary given the suite of tasks in Phuong et al’s “self-proliferation” section. I would appreciate a more detailed comparison between your tasks and theirs, particularly for the "tool improvement" tasks, explaining how you contribute beyond existing evaluations. 2. The re

Code & Models

Repositories

kaifronsdal/self-reasoning-evals
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques