ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin; Arun Verma; Zhongxiang Dai; Daniela Rus; See-Kiong Ng; Bryan Kian Hsiang Low

arXiv:2505.19241·cs.LG·May 18, 2026

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low

PDF

1 Video 3 Reviews

TL;DR

ActiveDPO introduces a theoretically grounded, LLM-aware active data selection method for preference-based alignment, significantly reducing data collection costs and improving alignment quality.

Contribution

It presents a novel active data selection algorithm that accounts for the LLM's influence, outperforming existing methods in sample-efficient alignment.

Findings

01

ActiveDPO outperforms existing methods across multiple models.

02

It effectively reduces the amount of human preference data needed.

03

The method demonstrates superior performance on real-world datasets.

Abstract

The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality datasets of human preferences. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive assumptions about the reward function, such as linear latent reward functions. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model used for active data selection.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- First theoretical and algorithm formulation of active learning for DPO. - Provides algorithm for both online and offline settings, enabling flexible application. - Validate the theoretical foundation with reasonable empirical results.

Weaknesses

- Reliance on a log-linear policy approximation may lead to shallow alignment, easy reward-hacking and neglected task complexity. - The paper assumes (Assumption 1) that *all policies* are log-linear in the last layer features. This assumptions is the key assumption that enables the D-optimal design analysis, but concurrently restricts the model's expressive capacity. - In realistic settings, such as aligning LLMs, the relation between prompt/response pairs and human judgements is likely

Reviewer 02Rating 6Confidence 4

Strengths

The active preference learning research topic is very important. I like the method which is similar to the influence function to conduct active learning. The writing is clear and the scale of the experiment is OK.

Weaknesses

**1. Discussion on Comparison with Active Preference Learning for Large Language Models** A detailed comparison with Active Preference Learning for Large Language Models (arXiv:2402.08114 ) would strengthen the paper. In particular, while the APL paper focuses on reward difference—which only reflects the immediate step before updates—the current work’s use of gradient difference captures the potential improvement after updates. This distinction highlights a more forward-looking and theoreticall

Reviewer 03Rating 6Confidence 4

Strengths

+ The motivation of the Active DPO method is theoretically grounded and principled. + The implementation details which allow the method to become tractable are quite clever and dramatically improve the tractability of the method. + The ablations are quite in-depth and demonstrate effectively which parts of the algorithm are important; and show that their design is robust to various different models and datasets. + The Active DPO method appears to outperform all other Active DPO approaches in the

Weaknesses

+ The description of the algorithm is slightly unclear at times. Particularly around the description of the matrix V_t. This is presumably an outer product of the gradients, but the authors don't comment on the fact that this is obviously intractable to store for modern LLMs, let alone invert. I appreciate that the matrix is tractable when projected to 8192 dimensions with the approximations made later on, but the authors should highlight this difficulty earlier on. + The computational requireme

Videos

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment· slideslive

Taxonomy

TopicsAdvanced Data Compression Techniques · Advanced Image and Video Retrieval Techniques · Algorithms and Data Compression

MethodsALIGN