Correcting Large Language Model Behavior via Influence Function

Han Zhang; Zhuo Zhang; Yi Zhang; Yuanzhao Zhai; Hanyang Peng; Yu Lei,; Yue Yu; Hui Wang; Bin Liang; Lin Gui; Ruifeng Xu

arXiv:2412.16451·cs.LG·December 24, 2024

Correcting Large Language Model Behavior via Influence Function

Han Zhang, Zhuo Zhang, Yi Zhang, Yuanzhao Zhai, Hanyang Peng, Yu Lei,, Yue Yu, Hui Wang, Bin Liang, Lin Gui, Ruifeng Xu

PDF

Open Access 1 Video

TL;DR

This paper introduces LANCET, a novel, human-involvement-free method for correcting undesirable behaviors in large language models by leveraging influence functions to identify and adjust impactful training data.

Contribution

LANCET is the first approach to use influence functions for large language model behavior correction without human data collection or manual intervention.

Findings

01

LANCET effectively corrects undesirable model behaviors.

02

It outperforms methods relying on human preference data.

03

It improves interpretability of model preference learning.

Abstract

Recent advancements in AI alignment techniques have significantly improved the alignment of large language models (LLMs) with static human preferences. However, the dynamic nature of human preferences can render some prior training data outdated or even erroneous, ultimately causing LLMs to deviate from contemporary human preferences and societal norms. Existing methodologies, whether they involve the curation of new data for continual alignment or the manual correction of outdated data for re-alignment, demand costly human resources. To address this challenge, we propose a novel approach, Large Language Model Behavior Correction with Influence Function Recall and Post-Training (LANCET), which requires no human involvement. LANCET consists of two phases: (1) using influence functions to identify the training data that significantly impact undesirable model outputs, and (2) applying an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Correcting Large Language Model Behavior via Influence Function· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques