Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Alexander Shabalin; Viacheslav Meshchaninov; Dmitry Vetrov

arXiv:2505.18853·cs.CL·May 18, 2026

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov

PDF

1 Repo 1 Models 3 Reviews

TL;DR

Smoothie introduces a novel diffusion method that smooths token embeddings based on semantic similarity, improving text generation quality by combining continuous and discrete diffusion advantages.

Contribution

It proposes a new diffusion approach on token embeddings that enhances text generation by balancing semantic structure and discreteness.

Findings

01

Outperforms existing diffusion models in generation quality.

02

Smoothing diffusion on token embeddings yields better performance than standard embedding or categorical simplex.

03

Code is publicly available at https://github.com/ashaba1in/smoothie.

Abstract

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence and unconditional generation tasks demonstrate that Smoothie outperforms existing…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- The proposed diffusion space, which perturbs representations based on semantic similarity, is a contribution to the field. - The authors provide empirical validation across multiple text generation tasks. The reported results suggest that the proposed method may offer a performance improvement over other diffusion-based baselines, and the inclusion of ablation studies helps to substantiate the specific design choices made in the SMOOTHIE framework.

Weaknesses

- The proposed method's reliance on a pre-trained word embedding model (in this case, BERT) may limit its scalability and applicability. This dependency raises questions about the framework's potential to scale effectively with larger models or different architectures, as it is tied to the properties and constraints of the initial embedding space. - The experimental evaluation is missing a common and important conditional text generation task: machine translation. Including results from machine

Reviewer 02Rating 4Confidence 4

Strengths

1. Unifies prior lines: maps each token to a vector of negative squared distances to all vocab embeddings, then diffuses and feeds softmax(D_t) to the model; enables natural argmax decoding while preserving semantics and discreteness. Clear training/sampling pseudocode. 2. The distance-based latent generalizes simplex diffusion (simplex emerges under a trivial metric), giving a clean conceptual frame. 3. Practical guidance on schedules/self-conditioning; moderate steps (~100–200) are sufficien

Weaknesses

1. Fixed pre-trained embeddings (E) cap expressivity; authors acknowledge end-to-end training would likely help but leave it to future work. 2. Fixed sequence length forces substantial padding; variable length is emulated by truncating after EOS, which is inefficient; prior early-truncation is ad hoc. 3. Every step computes softmax over the full vocabulary V (and final argmax), which scales poorly for large V and long m; no top-k/approximation is provided. 4. Relies on the Euclidean semantic

Reviewer 03Rating 6Confidence 3

Strengths

The core contribution of defining the diffusion space using semantic distances (Euclidean proximity) in the embedding space is highly intuitive and well-justified. It elegantly addresses the major trade-off in existing work: retaining semantic structure (like Gaussian diffusion) while enabling natural decoding from discrete representations (like Simplex diffusion).

Weaknesses

SMOOTHIE (like most text diffusion models) runs over fixed-length sequences. In practice, they set a dataset-specific max length and pad shorter sequences with a special padding token that the model learns to predict. The generation process is bounded by the preset max. It can emit different effective lengths up to a cap, but it doesn’t truly sample variable length the way an autoregressive model does.

Code & Models

Repositories

ashaba1in/smoothie
github

Models

🤗
yasserrmd/smoothie-diffusion-qqp
model· 39 dl· ♡ 2
39 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsDiffusion