The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu; Ziying Huang; Weicong Hong; Jian Xie; Renze Lou; and Kai Zhang

arXiv:2604.05995·cs.CL·April 8, 2026

The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, and Kai Zhang

PDF

1 Repo

TL;DR

This paper introduces a diagnostic framework revealing that current knowledge editing methods in large language models often only mimic target outputs without true internal change, risking instability and unreliability.

Contribution

It proposes a new evaluation approach that better reflects real-world conditions and uncovers the surface compliance phenomenon in existing memory editing techniques.

Findings

01

Editors often achieve high benchmark scores by mimicking outputs without structural change.

02

Recursive modifications lead to residual effects, causing instability and reduced reversibility.

03

Current evaluation frameworks may overestimate true memory modification success.

Abstract

Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real-world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self-assessment under in-context learning (ICL) settings that better reflect real-world application environments, specifically designed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

XiaojieGu/SA-MCQ
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.