BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer

TL;DR
This paper demonstrates that BLEU, a simple string-matching metric, can effectively replace complex reward models for aligning large language models with human preferences, reducing training costs while maintaining performance.
Contribution
The authors introduce BLEUBERI, a novel method that uses BLEU as a reward in RL-based alignment, showing competitive results with reward model-guided approaches across multiple benchmarks.
Findings
BLEU matches reward models in agreement with human preferences.
BLEUBERI achieves comparable performance to reward model-based methods.
BLEUBERI outputs are more factually grounded than competing methods.
Abstract
Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗yapeichang/Qwen2.5-7B-BLEUBERImodel· 4 dl· ♡ 14 dl♡ 1
- 🤗yapeichang/Qwen2.5-7B-RM8Bmodel· 5 dl5 dl
- 🤗yapeichang/Llama-3.1-8Bmodel· 9 dl9 dl
- 🤗yapeichang/Llama-3.1-8B-SFTmodel· 3 dl3 dl
- 🤗yapeichang/Llama-3.1-8B-BLEUBERImodel· 7 dl· ♡ 17 dl♡ 1
- 🤗yapeichang/Llama-3.1-8B-RM8Bmodel· 1 dl1 dl
- 🤗yapeichang/Qwen2.5-3B-BLEUBERImodel· 1 dl1 dl
- 🤗yapeichang/Qwen2.5-3B-RM8Bmodel· 86 dl86 dl
- 🤗yapeichang/Qwen2.5-3B-SFTmodel· 2 dl2 dl
- 🤗yapeichang/Qwen2.5-7B-SFTmodel· 2 dl2 dl
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsBalanced Selection
