From Lists to Emojis: How Format Bias Affects Model Alignment

Xuanchang Zhang; Wei Xiong; Lichang Chen; Tianyi Zhou; Heng Huang; Tong Zhang

arXiv:2409.11704·cs.CL·May 26, 2025

From Lists to Emojis: How Format Bias Affects Model Alignment

Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang

PDF

Open Access 1 Video

TL;DR

This paper investigates how format biases in preference models, including human evaluators and LLMs, influence model rankings and alignment, revealing that small biased datasets can significantly skew results and that format manipulation is easier than improving response quality.

Contribution

The study provides a comprehensive analysis of format biases beyond verbosity in preference learning and demonstrates their impact on model evaluation and alignment.

Findings

01

Preference models exhibit strong biases towards specific formats like lists and emojis.

02

Small biased datasets can inject significant bias into reward models.

03

Format biases can be exploited by alignment algorithms more easily than response quality improvements.

Abstract

In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Lists to Emojis: How Format Bias Affects Model Alignment· underline

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Text Readability and Simplification

MethodsAttention Is All You Need · Direct Preference Optimization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout · Dense Connections