Aligning Large Language Models from Self-Reference AI Feedback with one   General Principle

Rong Bao; Rui Zheng; Shihan Dou; Xiao Wang; Enyu Zhou; Bo Wang; Qi; Zhang; Liang Ding; Dacheng Tao

arXiv:2406.11190·cs.CL·June 18, 2024·1 cites

Aligning Large Language Models from Self-Reference AI Feedback with one General Principle

Rong Bao, Rui Zheng, Shihan Dou, Xiao Wang, Enyu Zhou, Bo Wang, Qi, Zhang, Liang Ding, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-reference AI feedback framework for aligning large language models, enabling them to generate high-quality preference feedback using simple principles without human input.

Contribution

It proposes a novel self-reference feedback method that reduces bias and improves alignment of LLMs using general principles like "best for humanity."

Findings

01

Enhanced feedback quality from 13B and 70B Llama2-Chat models.

02

Significant improvements in benchmark performance after reinforcement learning.

03

Reduced position bias through self-consistency and semantic perplexity measures.

Abstract

In aligning large language models (LLMs), utilizing feedback from existing advanced AI rather than humans is an important method to scale supervisory signals. However, it is highly challenging for AI to understand human intentions and societal values, and provide accurate preference feedback based on these. Current AI feedback methods rely on powerful LLMs, carefully designed specific principles to describe human intentions, and are easily influenced by position bias. To address these issues, we propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback under simple and general principles such as ``best for humanity``. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference, and finally determine which answer better fits human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rbao2018/self_ref_feedback
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques