Would I Lie To You? Inference Time Alignment of Language Models using   Direct Preference Heads

Avelina Asada Hadji-Kyriacou; Ognjen Arandjelovic

arXiv:2405.20053·cs.CL·May 31, 2024

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces Direct Preference Heads, a fine-tuning method that aligns language models with human preferences without compromising their reasoning, demonstrated through improved performance on multiple benchmarks.

Contribution

The paper proposes a novel fine-tuning framework using auxiliary reward heads to better align language models with human preferences while preserving reasoning capabilities.

Findings

01

Models with DPH outperform SFT and DPO on GLUE, RACE, and GPT4All.

02

Theoretical analysis links DPH to Conservative Direct Preference Optimization.

03

DPH maintains reasoning abilities while improving alignment with human preferences.

Abstract

Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Avelina9X/direct-preference-heads
pytorchOfficial

Models

🤗
Avelina/lovelace-medium-alpha1
model· 13 dl· ♡ 1
13 dl♡ 1

Videos

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods