TL;DR
Smoothie introduces a novel diffusion method that smooths token embeddings based on semantic similarity, improving text generation quality by combining continuous and discrete diffusion advantages.
Contribution
It proposes a new diffusion approach on token embeddings that enhances text generation by balancing semantic structure and discreteness.
Findings
Outperforms existing diffusion models in generation quality.
Smoothing diffusion on token embeddings yields better performance than standard embedding or categorical simplex.
Code is publicly available at https://github.com/ashaba1in/smoothie.
Abstract
Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence and unconditional generation tasks demonstrate that Smoothie outperforms existing…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed diffusion space, which perturbs representations based on semantic similarity, is a contribution to the field. - The authors provide empirical validation across multiple text generation tasks. The reported results suggest that the proposed method may offer a performance improvement over other diffusion-based baselines, and the inclusion of ablation studies helps to substantiate the specific design choices made in the SMOOTHIE framework.
- The proposed method's reliance on a pre-trained word embedding model (in this case, BERT) may limit its scalability and applicability. This dependency raises questions about the framework's potential to scale effectively with larger models or different architectures, as it is tied to the properties and constraints of the initial embedding space. - The experimental evaluation is missing a common and important conditional text generation task: machine translation. Including results from machine
1. Unifies prior lines: maps each token to a vector of negative squared distances to all vocab embeddings, then diffuses and feeds softmax(D_t) to the model; enables natural argmax decoding while preserving semantics and discreteness. Clear training/sampling pseudocode. 2. The distance-based latent generalizes simplex diffusion (simplex emerges under a trivial metric), giving a clean conceptual frame. 3. Practical guidance on schedules/self-conditioning; moderate steps (~100–200) are sufficien
1. Fixed pre-trained embeddings (E) cap expressivity; authors acknowledge end-to-end training would likely help but leave it to future work. 2. Fixed sequence length forces substantial padding; variable length is emulated by truncating after EOS, which is inefficient; prior early-truncation is ad hoc. 3. Every step computes softmax over the full vocabulary V (and final argmax), which scales poorly for large V and long m; no top-k/approximation is provided. 4. Relies on the Euclidean semantic
The core contribution of defining the diffusion space using semantic distances (Euclidean proximity) in the embedding space is highly intuitive and well-justified. It elegantly addresses the major trade-off in existing work: retaining semantic structure (like Gaussian diffusion) while enabling natural decoding from discrete representations (like Simplex diffusion).
SMOOTHIE (like most text diffusion models) runs over fixed-length sequences. In practice, they set a dataset-specific max length and pad shorter sequences with a special padding token that the model learns to predict. The generation process is bounded by the preset max. It can emit different effective lengths up to a cap, but it doesn’t truly sample variable length the way an autoregressive model does.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsDiffusion
