Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment
Byeonghu Na, Hyungho Na, Yeongmin Kim, Suhyeon Jo, HeeSun Bae, Mina Kang, Il-Chul Moon

TL;DR
This paper introduces Wasserstein Policy Regularization, a semantic-aware method for aligning large language models with human preferences, outperforming traditional divergence-based approaches by capturing the geometry of token space.
Contribution
The paper proposes a novel Wasserstein-based regularization for RLHF that incorporates semantic information, improving alignment of large language models.
Findings
WPR outperforms KL- and f-divergence-based methods in experiments.
Semantic-aware regularization improves model alignment.
Method is compatible with standard RL algorithms.
Abstract
Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its -divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and…
Peer Reviews
Decision·ICLR 2026 Poster
Clear, Compelling Motivation: The paper's greatest strength is its crystal-clear motivation. The "cat/kitten/table" example in Figure 1 immediately and intuitively communicates the flaw in existing methods and the rationale for the new one . Elegant and Tractable Formulation: The core technical contribution is the derivation of a tractable algorithm (WPR) from a complex theoretical concept (Wasserstein distance). The insight to use the dual optimal variables (ϕ*) as a direct reward penalty (The
Cost Matrix C as a "Black Box": The entire semantic-awareness of the method hinges on the cost matrix C, which is built from the token embeddings of the reference SFT model. This is a reasonable choice, but its impact is not deeply explored. The ablation study only compares L2 vs. Cosine distance, but not the source or quality of the embeddings. The paper doesn't answer: what if the SFT model's embeddings are of poor quality? Would using embeddings from a more powerful, external model improve re
- By replacing KL with a Wasserstein-based penalty that reflects token embedding geometry, the method tries to captures semantic similarity (e.g., “cat” vs. “kitten”), which could lead to more natural, aligned outputs rather than over-penalizing harmless semantic variations. But there are some concerns as detailed in weaknesses. - The paper derives a dual formulation enabling efficient Sinkhorn-based computation, making the method compatible with standard RLHF pipelines (e.g., PPO) and achievi
- Why does token-space geometry matter in practice? What is lost with KL, this rationale should be strengthened. The authors show win-rates but don’t link it explicitly to semantic alignment behaviour. - “Incorporates the geometry of the token space” this is unclear without explanation which deserves clarity. They rely on embedding-space distances, not true semantic grounding. This raises questions: Which embedding space? Frozen SFT model embeddings? Reference model embeddings? Is the embedding
1. The paper is well-written and easy to read. 2. For the regularization, both theoretical formulation and practical implementation are introduced in detail and analyzed via complexity perspective. 3. The comprehensive experiments show the outperformance of new regularization and the effect of each component in the framework.
1. The only base model in the experiments is the pre-trained Gemma-2B. The results of models from other LLM families will further validate the effectiveness of the regularization. 2. The authors introduce the ignorance of semantic similarity in the previous regularization as the main motivation, and highlight that proposed WPR is a semantic-aware regularization. However, the analysis about semantic awareness is missed in the description of the regularization method. In the experiments, this is n
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques
