The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

Elon Ezra; Ariel Weizman; Amos Azaria

arXiv:2508.12277·cs.CL·August 19, 2025

The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution

Elon Ezra, Ariel Weizman, Amos Azaria

PDF

Open Access

TL;DR

This paper introduces the Self-Execution Benchmark to evaluate whether large language models can predict aspects of their own responses, revealing limitations in their self-awareness and reasoning about their behavior.

Contribution

The paper proposes a novel benchmark for assessing LLMs' ability to anticipate their own responses, highlighting a fundamental limitation in their self-representational capabilities.

Findings

01

Models perform poorly on self-prediction tasks

02

Increased size does not improve self-prediction performance

03

Reveals limitations in LLMs' self-awareness and reasoning

Abstract

Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model's ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCorporate Insolvency and Governance