Annotation-Efficient Language Model Alignment via Diverse and Representative Response Texts
Yuu Jinnai, Ukyo Honda

TL;DR
This paper introduces AEPO, a method that efficiently selects diverse and representative response subsets for preference annotation, improving language model alignment with limited annotation resources.
Contribution
AEPO is a novel approach that optimizes preference dataset quality by strategic response selection, reducing annotation costs while maintaining effectiveness.
Findings
AEPO outperforms baselines with the same annotation budget.
Selected responses increase diversity and representativeness.
Improves language model alignment with fewer annotations.
Abstract
Preference optimization is a standard approach to fine-tuning large language models to align with human preferences. The quantity, diversity, and representativeness of the preference dataset are critical to the effectiveness of preference optimization. However, obtaining a large amount of preference annotations is difficult in many applications. This raises the question of how to use the limited annotation budget to create an effective preference dataset. To this end, we propose Annotation-Efficient Preference Optimization (AEPO). Instead of exhaustively annotating preference over all available response texts, AEPO selects a subset of responses that maximizes diversity and representativeness from the available responses and then annotates preference over the selected ones. In this way, AEPO focuses the annotation budget on labeling preferences over a smaller but informative subset of…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The authors identify an important problem, which is the need to make best possible of use of human preference annotation effort. The paper offers a broad range of evaluations and a comprehensive discussion of the recent preference optimization literature. The method is described clearly and would be straightforward to implement, although the authors also make code available. For me, the most persuasive result is AEPO at N>2 vs AEPO at N=2, which shows that the proposed method of example selecti
From a conceptual perspective, the idea of selecting diverse *and* representative training examples is not original, and the application to preference learning does not change the fundamentals in a significant way. See, e.g. (Wei et al 2006 "Submodularity in Data Subset Selection and Active Learning"; Bıyık et a; 2019 "Batch active learning using determinantal point processes"), which propose more principled approaches to the same goal. In fact, the situation is significantly simplified in the s
1. Annotation efficiency is an important research problem in preference learning. 2. AEPO empirically has some improvements over DPO or DPO variant.
1. The AEPO approach appears to be a straightforward extension of Diverse Minimum Bayes Risk (DMBR) decoding (https://arxiv.org/pdf/2401.05054) applied to preference annotations. The objective function remains largely identical to that in DMBR decoding. 2. Lack of Comparative Baseline with DMBR: Since the DMBR paper already claims to improve data quality in generated responses, it’s essential for the authors to evaluate AEPO directly against DMBR decoding. This would clarify whether AEPO actual
AEPO seems like a valuable technique targeting an important area, and the results show that being smart about which preference pairs to use can be more effective than simply having a lot of data, at least for the configurations studied in the paper. The experiments seem to be carefully done and comprehensively reported, and the gains seem significant.
1. The opening of the paper emphasizes how important it is to use human annotators, so it is surprising to find, in section 4, that no human annotation was used for the experiments. 2. The above is part of a larger tweak to the narrative that I would suggest: there is evidently no reason to focus on cost in particular, since the expensive settings seem to do very poorly. 3. The choice of DPO as the objective is significant, I believe. In Figure 3, the blue line goes up a bit and then goes down
The motivation behind this paper is highly relevant to the research community. In scenarios where human annotations are costly, reducing the number of annotations can substantially cut down expenses. Leveraging information gain as a criterion for sample selection in the context of alignment presents a promising approach.
**Methodology:** There is a type mismatch between the definition of information gain (which applies to sets of responses) and its application to preference pairs (which are sets of response pairs). Additionally, the paper introduces information gain broadly but quickly shifts to the two heuristics without a formal mathematical connection between them. A constructive addition would be to use toy examples to measure information gain and demonstrate how the two heuristic metrics correlate with it.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsDirect Preference Optimization · ALIGN
