Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

Haihui Pan; Yuzhong Hong; Shaoke Lv; Junwei Bao; Hongfei Jiang; Yang Song

arXiv:2602.15894·cs.CL·February 19, 2026

Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

Haihui Pan, Yuzhong Hong, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

PDF

Open Access 3 Reviews

TL;DR

This paper introduces QEMPO, a method that enhances large language model output diversity by maximizing entropy under quality constraints, maintaining or improving performance compared to existing alignment techniques.

Contribution

The paper presents a novel theoretical decomposition of alignment into quality and diversity, and proposes QEMPO, a new policy optimization method that balances diversity and quality in LLM outputs.

Findings

01

QEMPO improves output diversity without sacrificing quality.

02

QEMPO achieves comparable or better performance than RLHF.

03

Theoretical analysis decomposes alignment into quality and diversity components.

Abstract

Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- **Simple and practical modification of RLHF:** The approach extends existing RLHF pipelines with minimal implementation overhead. - **Code availability:** Implementations for both learning modes are clearly described, supporting reproducibility.

Weaknesses

- **No related works at all:** The paper omits nearly all prior work on *quality–diversity trade-offs*, both in evaluation and training, leaving the contribution insufficiently contextualized. This includes several theoretical and empirical studies on balancing diversity and reward, entropy-based evaluation, and diversity-aware optimization methods. These works include dozens of papers, among which I highlight a few important ones and recent contributions below: - **Sajjadi, M. S. M. et al

Reviewer 02Rating 4Confidence 3

Strengths

- Clearly motivated by the observed diversity reduction problem in LLM alignment. - Proposes both offline and online optimization variants (QEMPO, QEMPO-KL) with corresponding closed-form solutions. - Empirical results on multiple model scales (1B–8B) show consistent diversity improvements across lexical, syntactic, and semantic dimensions.

Weaknesses

The theoretical analysis appears problematic in several parts (although proofs in Appendix were not fully verified). - Proposition 2 seems to restate the alignment objective as depending on both “quality” and “diversity,” but the result largely follows from how π* is constructed. It is assumed to be uniform over acceptable responses and negligible elsewhere. Under this assumption, the KL term naturally decomposes into an entropy term and a monotone function of the probability mass on Y_w. Whil

Reviewer 03Rating 2Confidence 5

Strengths

The paper is studying an interesting question.

Weaknesses

1. If I understand correctly, the definitions of $Y_l$ and $Y_w$ are not well-defined. - It seems that $Y_l$ and $Y_w$ represent "bad" and "good" responses when deriving theory, where "good" and "bad" are treated as absolute concepts. However, in equation (7) and in section 4.1, $y_w$ and $y_l$ are instead relatively good and bad responses, since the evaluation is based on pairwise comparisons. There seems to be a conceptual gap between these two interpretations, and the paper does not addres

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education