SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization
Chaoyue He, Xin Zhou, Di Wang, Hong Xu, Wei Liu, Chunyan Miao

TL;DR
SP2DPO enhances preference optimization by customizing pair-specific parameters based on semantic annotations, improving alignment with human preferences without additional training overhead.
Contribution
It introduces a novel instance-specific scheduling method for DPO, leveraging semantic annotations to better handle heterogeneous preference data.
Findings
SP2DPO performs competitively with global-beta DPO baselines.
It improves length-controlled win rate on some models.
Zero training overhead is incurred by the method.
Abstract
Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Machine Learning and Data Classification · Multi-Criteria Decision Making
