ZYN: Zero-Shot Reward Models with Yes-No Questions for RLAIF
Victor Gallego

TL;DR
This paper introduces ZYN, a zero-shot reward modeling approach using Yes-No questions with instruction-tuned language models to align text generation with human preferences without labeled data.
Contribution
The paper presents a novel zero-shot reward model framework that leverages Yes-No prompts for guiding language models, applicable across various text generation tasks.
Findings
Effective in detoxification and sentiment optimization
Compatible with quality-diversity search methods
Enables personalized prompt generation for text-to-image tasks
Abstract
In this work, we address the problem of directing the text generation of a language model (LM) towards a desired behavior, aligning the generated text with the preferences of the human operator. We propose using another, instruction-tuned language model as a critic reward model in a zero-shot way thanks to the prompt of a Yes-No question that represents the user preferences, without requiring further labeled data. This zero-shot reward model provides the learning signal to further fine-tune the base LM using Reinforcement Learning from AI Feedback (RLAIF); yet our approach is also compatible in other contexts such as quality-diversity search. Extensive evidence of the capabilities of the proposed ZYN framework is provided through experiments in different domains related to text generation, including detoxification; optimizing sentiment of movie reviews, or any other attribute; steering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsBalanced Selection
