TL;DR
CLaRE is a lightweight, efficient technique that quantifies fact entanglement in LLMs to predict and analyze ripple effects of model edits, improving post-edit evaluation and safety.
Contribution
Introduces CLaRE, a novel, fast, and resource-efficient method to identify potential ripple effects in LLMs using representation-level entanglement analysis.
Findings
CLaRE achieves 62.2% better correlation with ripple effects than baselines.
CLaRE is 2.74 times faster and uses 2.85 times less GPU memory than previous methods.
The approach enables scalable analysis and improved preservation of factual knowledge in LLMs.
Abstract
The static knowledge representations of large language models (LLMs) inevitably become outdated or incorrect over time. While model-editing techniques offer a promising solution by modifying a model's factual associations, they often produce unpredictable ripple effects, which are unintended behavioral changes that propagate even to the hidden space. In this work, we introduce CLaRE, a lightweight representation-level technique to identify where these ripple effects may occur. Unlike prior gradient-based methods, CLaRE quantifies entanglement between facts using forward activations from a single intermediate layer, avoiding costly backward passes. To enable systematic study, we prepare and analyse a corpus of 11,427 facts drawn from three existing datasets. Using CLaRE, we compute large-scale entanglement graphs of this corpus for multiple models, capturing how local edits propagate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
