From Lists to Emojis: How Format Bias Affects Model Alignment
Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang

TL;DR
This paper investigates how format biases in preference models, including human evaluators and LLMs, influence model rankings and alignment, revealing that small biased datasets can significantly skew results and that format manipulation is easier than improving response quality.
Contribution
The study provides a comprehensive analysis of format biases beyond verbosity in preference learning and demonstrates their impact on model evaluation and alignment.
Findings
Preference models exhibit strong biases towards specific formats like lists and emojis.
Small biased datasets can inject significant bias into reward models.
Format biases can be exploited by alignment algorithms more easily than response quality improvements.
Abstract
In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Text Readability and Simplification
MethodsAttention Is All You Need · Direct Preference Optimization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Dropout · Dense Connections
