Semantic Smoothing for Language Models via Distribution Estimation and Embeddings

Haricharan Balasundaram; Swathi Shree Narashiman; Pranay Mathur; Andrew Thangaraj

arXiv:2605.07994·cs.IT·May 11, 2026

Semantic Smoothing for Language Models via Distribution Estimation and Embeddings

Haricharan Balasundaram, Swathi Shree Narashiman, Pranay Mathur, Andrew Thangaraj

PDF

TL;DR

This paper introduces semantic smoothing, a novel method for language models that leverages embeddings to improve statistical estimation and reduce perplexity.

Contribution

It formulates semantic smoothing as a distribution estimation problem with KL proximity side information and provides theoretical risk bounds and empirical validation.

Findings

01

Semantic smoothing reduces test perplexity on synthetic and real data.

02

The method achieves optimal worst-case KL risk bounds.

03

Experiments show consistent improvements over traditional smoothing techniques.

Abstract

We propose semantic smoothing, a smoothing method for language models that uses embeddings to share statistical observations across semantically similar contexts. The starting point is a decomposition of log-perplexity that motivates smoothing as a collection of distribution-estimation problems under Kullback-Leibler (KL) loss. We then show that, under a Lipschitz-logit model for embedding-based language generation, proximity of context embeddings implies proximity of the corresponding next-word distributions in KL divergence. Combining these observations, we formulate semantic smoothing as distribution estimation in KL loss with KL-proximity side information. For $n$ samples on a $d$ -symbol alphabet with a side-information distribution at KL distance $Δ$ , we give an interpolation estimator with worst-case KL risk $O (min {Δ, d / n})$ , and prove a matching-order lower bound for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.