SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Fenia Christopoulou; Ronald Cardenas; Gerasimos Lampouras; Haitham Bou-Ammar; Jun Wang

arXiv:2410.05102·cs.CL·November 3, 2025

SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

Fenia Christopoulou, Ronald Cardenas, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang

PDF

Open Access 3 Reviews

TL;DR

SparsePO introduces a novel token-weighting approach for preference alignment in language models, learning to sparsify token contributions to improve alignment with human preferences without sacrificing response quality.

Contribution

It proposes a flexible, learnable token masking method that dynamically balances reward and divergence during preference optimization, enhancing alignment performance.

Findings

01

Achieves +10% win rate in summarization tasks

02

Achieves +3% win rate in dialogue tasks

03

Maintains reasoning, relevance, and faithfulness of responses

Abstract

Direct alignment algorithms have proven an effective step for aligning language models to human-desired behaviors. Current variants of the Direct Preference Optimization objective have focused on a strict setting where all tokens are contributing signals of KL divergence and rewards to the loss function. However, human preference is not affected equally by each word in a sequence but is often dependent on specific words or phrases, e.g. existence of toxic terms leads to non-preferred responses. Based on this observation, we argue that not all tokens should be weighted equally during PO and propose a flexible objective termed SparsePO, that aims to automatically learn to weight the KL divergence and reward corresponding to each token during PO training. We propose two different variants of weight-masks that can either be derived from the reference model itself or learned on the fly.…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is clear and well-written. 2. The method seems novel, has a clear and well-established motivation, and is mathematically rigorous. 3. The paper performs experiments on a varied set of tasks. 4. The sentiment control experiments show good trade-offs between KL divergence and reward. The "Sparsity and Token-level KL divergence" experiment is insightful. 5. Nice improvements on IFEVAL and BBH with H&H training.

Weaknesses

1. The TL;DR dataset can be unfaithful and a small set of 120 prompts can hinder results further. Hence I tend to suspect the results. This is a more experimental design problem than a method problem. For faithfulness, AFAIK there are better methods like Q^2, True, GPM, and more. 2. Although the H&H shows some nice results, the size of the model combined with the difficulty of the benchmarks (OpenLLM-2 is designed to be much harder than 1) limit the ability to properly assess the method capabil

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper presents a technically sound approach. The motivation for proposing the objective function of SparsePO lies in the classic problem of token contribution allocation in reinforcement learning. The transformation is well-motivated and follows a logical progression. 2. The use of masks to control the contribution of each token is a valid approach. The two proposed mask computation strategies, MAPO and SPARSEPO, are clearly described and seem feasible. The technical details provided in

Weaknesses

1. Learned sparse masks do not necessarily match human preferences: In the learnable sparse mask, the author only illustrated in the paper how to adjust parameters to ensure the learned mask is sparse. However, it cannot be guaranteed that the crucial tokens are learned correctly. For example, in Figure 9(a), SparsePO-common rewards assigns almost equal rewards to all tokens. 2. Inconsistent performance across metrics: Table 2 shows that SparsePO gains over pass@100 but has a slight decay in th

Reviewer 03Rating 5Confidence 4

Strengths

1. SparsePO introduces dynamic token weighting, enhancing model adaptability and generation diversity across different preference criteria. 2. The method is evaluated across multiple datasets.

Weaknesses

1. The core motivation - that human preferences depend on specific words rather than equally on all tokens - lacks empirical and theoretical validation. 2. The introduction of m(y_t) in Equation 3 does not guarantee optimization equivalence with previous work (Zeng et al., 2024). 3. The learnable sparse mask implementation using a single feed-forward network requires more theoretical justification. Additionally, the method's sensitivity to model architecture and data distribution needs further

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Video Analysis and Summarization

MethodsParrot optimizer: Algorithm and applications to medical problems