Not All Preference Pairs Are Created Equal: A Recipe for   Annotation-Efficient Iterative Preference Learning

Sen Yang; Leyang Cui; Deng Cai; Xinting Huang; Shuming Shi; Wai Lam

arXiv:2406.17312·cs.CL·October 14, 2024

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

Sen Yang, Leyang Cui, Deng Cai, Xinting Huang, Shuming Shi, Wai Lam

PDF

Open Access

TL;DR

This paper proposes a strategy for selecting response pairs with small reward margins to improve annotation efficiency in iterative preference learning, outperforming random selection and optimizing annotation budgets.

Contribution

It introduces a margin-based selection method for response pairs in preference learning, leveraging uncertainty and distribution shift assumptions, to reduce annotation costs while maintaining performance.

Findings

01

Selecting small-margin response pairs improves learning efficiency.

02

Annotating earlier iterations yields better results than later ones.

03

Margin-based selection outperforms random sampling in experiments.

Abstract

Iterative preference learning, though yielding superior performances, requires online annotated preference labels. In this work, we study strategies to select worth-annotating response pairs for cost-efficient annotation while achieving competitive or even better performances compared with the random selection baseline for iterative preference learning. Built on assumptions regarding uncertainty and distribution shifts, we propose a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits. Through extensive experiments, we show that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios. Besides, our empirical results suggest allocating more annotation budgets in the earlier iterations rather than later across multiple iterations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Rough Sets and Fuzzy Logic

MethodsDirect Preference Optimization