Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiaoge Zhang, Tianyi Li, Kaiyu Tang, Xiao Li, Jing Li

TL;DR
This paper reveals that dataset-level metrics hide the true extent of non-determinism in diffusion language models, and proposes a fine-grained, factor-aware evaluation method to better understand model variability.
Contribution
It introduces a detailed evaluation framework and Factor Variance Attribution (FVA) to analyze and attribute sources of non-determinism in diffusion language models.
Findings
Dataset-level metrics attenuate non-determinism, masking variability.
Non-determinism is pervasive and varies with model factors like guidance scale and diffusion steps.
Code generation tasks are more sensitive to non-determinism than question answering.
Abstract
Diffusion language models (DLMs) have emerged as a promising paradigm for large language models (LLMs), yet the non-deterministic behavior of DLMs remains poorly understood. The existing non-determinism evaluations for LLMs predominantly rely on dataset-level metrics under fixed inference configurations, providing limited insight into how model behavior varies across runs and evaluation conditions. In this work, we show that dataset-level metrics systematically attenuate non-determinism in diffusion language models by aggregating sample-level prediction quality across different runs. As a result, configurations with similar aggregate performance can exhibit substantially different behaviors on individual inputs, leaving fine-grained instability and distinct error patterns uncharacterized. To address this limitation, we conduct a fine-grained evaluation of non-determinism based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
