BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Yapei Chang; Yekyung Kim; Michael Krumdick; Amir Zadeh; Chuan Li; Chris Tanner; Mohit Iyyer

arXiv:2505.11080·cs.CL·October 27, 2025

BLEUBERI: BLEU is a surprisingly effective reward for instruction following

Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer

PDF

Open Access 1 Repo 10 Models 1 Datasets 1 Video

TL;DR

This paper demonstrates that BLEU, a simple string-matching metric, can effectively replace complex reward models for aligning large language models with human preferences, reducing training costs while maintaining performance.

Contribution

The authors introduce BLEUBERI, a novel method that uses BLEU as a reward in RL-based alignment, showing competitive results with reward model-guided approaches across multiple benchmarks.

Findings

01

BLEU matches reward models in agreement with human preferences.

02

BLEUBERI achieves comparable performance to reward model-based methods.

03

BLEUBERI outputs are more factually grounded than competing methods.

Abstract

Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lilakk/bleuberi
pytorchOfficial

Models

Datasets

yapeichang/BLEUBERI-Tulu3-50k
dataset· 22 dl
22 dl

Videos

BLEUBERI: BLEU is a surprisingly effective reward for instruction following· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research

MethodsBalanced Selection