Correcting Large Language Model Behavior via Influence Function
Han Zhang, Zhuo Zhang, Yi Zhang, Yuanzhao Zhai, Hanyang Peng, Yu Lei,, Yue Yu, Hui Wang, Bin Liang, Lin Gui, Ruifeng Xu

TL;DR
This paper introduces LANCET, a novel, human-involvement-free method for correcting undesirable behaviors in large language models by leveraging influence functions to identify and adjust impactful training data.
Contribution
LANCET is the first approach to use influence functions for large language model behavior correction without human data collection or manual intervention.
Findings
LANCET effectively corrects undesirable model behaviors.
It outperforms methods relying on human preference data.
It improves interpretability of model preference learning.
Abstract
Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, whether they involve the curation of new data for continual alignment or the manual correction of outdated data for re-alignment, demand costly human resources. To address this challenge, we propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET), which requires no human involvement. LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
