TL;DR
This paper introduces a diagnostic framework revealing that current knowledge editing methods in large language models often only mimic target outputs without true internal change, risking instability and unreliability.
Contribution
It proposes a new evaluation approach that better reflects real-world conditions and uncovers the surface compliance phenomenon in existing memory editing techniques.
Findings
Editors often achieve high benchmark scores by mimicking outputs without structural change.
Recursive modifications lead to residual effects, causing instability and reduced reversibility.
Current evaluation frameworks may overestimate true memory modification success.
Abstract
Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real-world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self-assessment under in-context learning (ICL) settings that better reflect real-world application environments, specifically designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
