MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou

TL;DR
This paper introduces MOOSE-Chem, an LLM-based framework for autonomous discovery of novel, high-quality chemistry hypotheses by decomposing the process into retrieval, composition, and ranking tasks, validated on a new benchmark.
Contribution
The work presents a formal mathematical decomposition of hypothesis discovery and implements it through the MOOSE-Chem framework, demonstrating LLMs' ability to rediscover hypotheses without data contamination.
Findings
LLMs can effectively retrieve inspirations for hypotheses.
MOOSE-Chem successfully rediscovered core hypotheses from recent high-impact papers.
LLMs may encode latent scientific knowledge not yet recognized by humans.
Abstract
Scientific discovery plays a pivotal role in advancing human society, and recent progress in large language models (LLMs) suggests their potential to accelerate this process. However, it remains unclear whether LLMs can autonomously generate novel and valid hypotheses in chemistry. In this work, we investigate whether LLMs can discover high-quality chemistry hypotheses given only a research background-comprising a question and/or a survey-without restriction on the domain of the question. We begin with the observation that hypothesis discovery is a seemingly intractable task. To address this, we propose a formal mathematical decomposition grounded in a fundamental assumption: that most chemistry hypotheses can be composed from a research background and a set of inspirations. This decomposition leads to three practical subtasks-retrieving inspirations, composing hypotheses with…
Peer Reviews
Decision·ICLR 2025 Poster
Originality: Firstly, while LLMs have been utilized for scientific discovery in social science and NLP, this paper is the first to investigate their potential in chemistry. Besides, The MOOSE-CHEM framework employs a three-step approach to retrieve inspiration papers, inference valid knowledge, identify hypotheses and rank them, which hasn’t been used in previous research. Moreover, the use of the evolutionary algorithm to foster a broader diversity in hypothesis generation is also an innovati
Firstly, using the same large language model to evaluate its own generated results may introduce bias. It is recommended to try using different LLMs to evaluate the results so as to guarantee the reliability of the results. For example, consider using models like LLaMa[1], Claude[2], Gemini[3], or other recent LLMs to compare outputs. If using the same LLM is necessary, you could collect hypotheses generated by humans and also have both experts and GPT-4 evaluate them. Then, compare their Hard/S
1. Generating research hypotheses is a complicated task, and the authors heuristically decomposed hypothesis generation into two steps: (1). inspiration retrieval, and (2). hypothesis refinement. In the hypothesis refinement step, the authors propose a novel “mutate and recombine” trick to help generate good hypotheses. 2. The experiments to verify each of the research questions are well-designed with good quality.
1. The introduction section could be written better and more clear. (a) It would be great if the authors could provide a summary of the major contributions of this work at the end of the introduction section. What are really the contribution to the field? (b) It would be great if the authors could briefly discuss why the decomposition of the major question is necessary, what’s the difference or connection between the proposed inspiration identification (the first step of the three) and Retrieval
The paper is generally well-written, and in good English. The text is clear and the authors did a good job guiding the reader through the motivation, the derivation of the method and motivating each of the proposed steps and experiments. The topic of the paper is very relevant and the results are positive. Related work is well covered, and experiments are included that compare the proposed method with previous work. Every claim made on the performance of the method is generally backed up with e
My main concerns with the paper are regarding the reproducibility, clarity and discussion of the approach: - Reproducibility: we note that the authors introduce a scientific benchmark, along with a novel framework for hypothesis generation. However, the authors do not provide access to the novel-introduced benchmark, which hampers the ability to really discriminate the difficulty of the tasks at hand. Additionally, this impedes the ability to reproduce the results or for future work to compare
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Text Analysis Techniques
MethodsSparse Evolutionary Training
