Aligning Large Language Models from Self-Reference AI Feedback with one General Principle
Rong Bao, Rui Zheng, Shihan Dou, Xiao Wang, Enyu Zhou, Bo Wang, Qi, Zhang, Liang Ding, Dacheng Tao

TL;DR
This paper introduces a self-reference AI feedback framework for aligning large language models, enabling them to generate high-quality preference feedback using simple principles without human input.
Contribution
It proposes a novel self-reference feedback method that reduces bias and improves alignment of LLMs using general principles like "best for humanity."
Findings
Enhanced feedback quality from 13B and 70B Llama2-Chat models.
Significant improvements in benchmark performance after reinforcement learning.
Reduced position bias through self-consistency and semantic perplexity measures.
Abstract
In aligning large language models (LLMs), utilizing feedback from existing advanced AI rather than humans is an important method to scale supervisory signals. However, it is highly challenging for AI to understand human intentions and societal values, and provide accurate preference feedback based on these. Current AI feedback methods rely on powerful LLMs, carefully designed specific principles to describe human intentions, and are easily influenced by position bias. To address these issues, we propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback under simple and general principles such as ``best for humanity``. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference, and finally determine which answer better fits human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
