Massive Editing for Large Language Models via Meta Learning

Chenmien Tan; Ge Zhang; Jie Fu

arXiv:2311.04661·cs.CL·January 26, 2024·5 cites

Massive Editing for Large Language Models via Meta Learning

Chenmien Tan, Ge Zhang, Jie Fu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces MALMEN, a scalable meta-learning approach for large language model editing that efficiently updates knowledge across various architectures and tasks, significantly surpassing existing methods in capacity and performance.

Contribution

MALMEN formulates parameter shifts as a least squares problem, enabling large-scale, multi-fact editing with limited memory, improving over prior hyper-network based methods.

Findings

01

Capable of editing thousands of facts across different LMs.

02

Outperforms existing editors in editing capacity and accuracy.

03

Effective on multiple knowledge-intensive NLP tasks.

Abstract

While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- This work is tackling a well-motivated problem, scaling up knowledge editing approaches. - The motivation behind the proposed approach (adjusting FFN weights, decomposing the optimization process) is clearly explained, and the solutions presented are reasonable.

Weaknesses

- The scope of the problem (scalability of MEND) could be narrow, and the proposed approach is only applicable for a specific knowledge editing approach. - Based on the experimental results, it is difficult to assert that this approach is significantly better than all other knowledge editing approaches in terms of scalability (not only MEND). - The poor LS score with GPT-J (6B) shows that this approach still edits unrelated facts. - Qualitative analysis is not provided. It’s hard to see when/why

Reviewer 02Rating 10· strong accept, should be highlighted at the conferenceConfidence 4

Strengths

- The paper provides plenty of technical details, and is fairly clear (though somewhat dense) - The method is straightforward and intuitive. I am unclear about the broader applicability of memory editing, but the technical details and performance are sufficiently convincing to me that this is a meaningful contribution.

Weaknesses

- The paper requires quite a bit of background on MEND. This is not inherently a bad thing since the paper is basically a direct modification of MEND, and the paper already spends a good deal of space building the background, but I think providing higher-level intuition in the exposition could help. - Section 4.2 wasn't very clear to me (in particular "truncating the back-propagation at the end of linear layers"). Figure 2 was significantly clearer, and I wonder if the authors could revisit the

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- The proposed combination of the least square method and the loss-based updating for massive editing is quite interesting and novel. - The truncated backprop algorithm is solidly designed to improve the efficiency, which is also quite interesting. - The experiment results show that the proposed method improves MEND or MEMIT under various settings.,

Weaknesses

- Instead of the least squared solution, the simple sum-based aggregation is not compared. To prove the effect of the proposed method, this simplified aggregation needs to be compared. - The description of Section 4.2 is largely dense, too hard to capture the details. In particular, Figure 2 provides the overall backprop flow, but why the training algorithm using the truncated backprop is not explicitly and clearly provided? - In GPT-J (6B), the proposed method doesn’t improve MEMIT, in terms o

Code & Models

Repositories

chenmientan/malmen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Discriminative Fine-Tuning · Dense Connections · Adam · Layer Normalization · Residual Connection · Linear Warmup With Cosine Annealing