Diffusion Large Language Models for Black-Box Optimization
Ye Yuan, Can (Sam) Chen, Zipeng Sun, Dinghuai Zhang, Christopher Pal, Xue Liu

TL;DR
This paper introduces dLLM, a diffusion large language model approach for black-box optimization that leverages bidirectional modeling and iterative denoising, achieving state-of-the-art results in few-shot design tasks.
Contribution
The paper proposes a novel diffusion LLM framework with in-context denoising and masked diffusion tree search for improved black-box optimization performance.
Findings
Achieves state-of-the-art results on design-bench in few-shot settings.
Effectively captures bidirectional dependencies in design generation.
Demonstrates the advantage of diffusion LLMs over autoregressive models.
Abstract
Offline black-box optimization (BBO) aims to find optimal designs based solely on an offline dataset of designs and their labels. Such scenarios frequently arise in domains like DNA sequence design and robotics, where only a few labeled data points are available. Traditional methods typically rely on task-specific proxy or generative models, overlooking the in-context learning capabilities of pre-trained large language models (LLMs). Recent efforts have adapted autoregressive LLMs to BBO by framing task descriptions and offline datasets as natural language prompts, enabling direct design generation. However, these designs often contain bidirectional dependencies, which left-to-right models struggle to capture. In this paper, we explore diffusion LLMs for BBO, leveraging their bidirectional modeling and iterative refinement capabilities. This motivates our in-context denoising module: we…
Peer Reviews
Decision·Submitted to ICLR 2026
1. First of all this work has a good presentation. Its way of introducing methodology is easy to follow. The paper also provides a thorough review of related work in Section 5, which, to my understanding, is important because, since 2023, there has been a line of works applying diffusion models/LLMs for black-box optimization. Diffusion LMs are also a part of LLMs, bearing huge similarities and relevantness; it is extremely important to situate this work with respect to the prior works. 2. Re
1. First, this work extends LLMs for BBO to diffusion LMs. The reason for such extension, is described by: using diffusion models to capture bidirectional dependencies. This does not seem to be wrong, but it is also highly untrivial to evaluate. Have the authors come up with certain mechanism theories to formally describe this to this argument? 2. The originality of the paper is somewhat limited. As mentioned, the work feels like a natural and incremental extension of existing ideas. The field
The empirical gains are impressive and showcases a strong case for pretrained diffusion models being used for BBO in planning problems. In this case, an autoregressive LLM base methods such as ORPO do not even perform as well as the classical Gaussian process based methods. The algorithm design is a simple adaptation of MCTS to this case. There are extensive ablations on the effect of tree depth, branching factor and offline dataset size.
The paper is poorly written on a technical level and the algorithmic and experimental details have not been explain within the paper. See "Questions" for some specific queries in this regard. Given these drawbacks, I cannot recommend acceptance for this work. Since this paper uses a large diffusion model whereas the previous works in this domain use simple techniques such as Gaussian process, a comparison of computation complexities of various methods is important. This is not provided in the p
- The proposed method consists of a straightforward integration of several existing techniques. The use of a GP's EI as a reward signal for an MCTS-guided denoising process is intuitive - The paper is overall well-written and the ablation studies are well-designed to demonstrate the importance of each of the proposed components - This work highlights a promising new direction for LLM-based optimizers by moving from autoregressive to diffusion-based models. If the efficiency concerns can be addre
- The paper's most significant weakness is its lack of theoretical analysis. Specifically, the proposed MDTS is presented without any formal guarantees (such as regret analysis), despite being built upon the MCTS framework - The method's guidance relies on EI from a GP trained on only a few examples in very high dimensional spaces. However, it is well-known that GPs with standard kernels often perform poorly in such settings (curse of dimensionality), and their uncertainty estimates can be unrel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms · Machine Learning in Materials Science · Machine Learning and Data Classification
