Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, Qing Li

TL;DR
This paper introduces S^2-Bench, a novel benchmark for evaluating large language models in open-domain, natural language-driven molecule generation, emphasizing their ability to produce diverse and valid molecular candidates beyond single-answer retrieval.
Contribution
The paper presents S^2-Bench, the first benchmark for one-to-many molecule generation tasks, and introduces OpenMolIns, a large instruction tuning dataset that improves Llama-3.1-8B's performance on these tasks.
Findings
LLMs can generate diverse valid molecules in open-domain settings.
OpenMolIns enhances Llama-3.1-8B's molecule generation capabilities.
Benchmark reveals limitations of existing models in creative molecular design.
Abstract
Recently, Large Language Models (LLMs) have shown great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on a one-to-one mapping, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench}), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to demonstrate genuine molecular understanding and generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The benchmark allows many valid molecules per prompt, not just one “right” answer, which is closer to how chemists actually design. 2. Success is checked automatically, and everything rolls up into a single headline number (WSR) so models are straightforward to compare. 3. Many models are evaluated side-by-side, revealing where current methods struggle (especially de-novo constraints) and showing that instruction-tuning on the released data can meaningfully boost performance.
1. The “weighted success rate” is computed as *success × one quality term* (similarity for MolEdit/MolOpt; novelty for MolCustom), then averaged uniformly across nine subtasks. Because there are no reported thresholds or sensitivity analyses, rankings may be unstable under this multiplicative choice and the equal subtask weights. 2. Prompts are generated from fixed templates, and MolCustom’s constraints largely boil down to counts of atoms, bonds, or functional groups. Important real-world specs
The paper makes a timely and impactful contribution by redefining how natural-language-driven molecule generation should be evaluated. (1) The benchmark is conceptually original, introducing the one-to-many mapping paradigm that better reflects chemical diversity and real-world design scenarios. (2) The dataset generation pipeline is well-engineered and reproducible, integrating chemical computation (RDKit) with LLM-based linguistic diversification. (3) The evaluation design—combining MolE
The main weakness lies in the limited methodological depth. While the benchmark is well-designed, the work does not provide theoretical insight into the relationship between language semantics and chemical structure reasoning. The Weighted Success Rate metric, though practical, appears heuristic, and its weighting choices are not empirically justified. The programmatic data generation may also introduce semantic drift between the instruction and molecule, especially after paraphrasing by LLMs. A
The principle that molecular generation should adhere to a one-to-many approach is crucial, and the author has developed a benchmark based on this notion, evaluating it across multiple LLMs. However, the core concept and the method of constructing the benchmark in this paper are highly similar to those in work [1].
The methodology is highly similar to that in work [1] and lacks comprehensive research. The LLMs evaluated are all general-purpose models, with a noticeable absence of specialized domain-specific models.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Natural Language Processing Techniques · Mathematics, Computing, and Information Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Residual Connection · Linear Layer · Weight Decay · Cosine Annealing · Linear Warmup With Cosine Annealing · Softmax · Attention Dropout · Attention Is All You Need
