Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving
Chang Su, Zhongkai Hao, Zhizhou Zhang, Zeyu Xia, Youjia Wu, Hang Su, Jun Zhu

TL;DR
HELIX is a hierarchical reinforcement learning framework that enhances exploration and solution quality in open-ended scientific problems, leveraging in-context learning and iterative policy refinement.
Contribution
It introduces a novel combination of diverse candidate pools and reinforcement learning for iterative improvement, advancing open-ended problem solving with LLMs.
Findings
Achieved state-of-the-art circle packing results with a 14B model.
Surpassed GPT-4o on standard ML benchmarks with a 5.95 F1 point improvement.
Demonstrated effectiveness in complex scientific problem solving.
Abstract
Large language models (LLMs) with reasoning abilities have demonstrated growing promise for tackling complex scientific problems. Yet such tasks are inherently domain-specific, unbounded and open-ended, demanding exploration across vast and flexible solution spaces. Existing approaches, whether purely learning-based or reliant on carefully designed workflows, often suffer from limited exploration efficiency and poor generalization. To overcome these challenges, we present HELIX -- a Hierarchical Evolutionary reinforcement Learning framework with In-context eXperiences. HELIX introduces two key novelties: (i) a diverse yet high-quality pool of candidate solutions that broadens exploration through in-context learning, and (ii) reinforcement learning for iterative policy refinement that progressively elevates solution quality. This synergy enables the discovery of more advanced solutions.…
Peer Reviews
Decision·ICLR 2026 Poster
1. This work effectively combines the known strategies for reasoning with LLMs, the RL fine-tuning, evolutionary algorithms, and in-context learning. To my knowledge, this combination is new. 2. The proposed method shows strong empirical results compared to baselines across many scientific problems, showing its practicality. 3. The source code is provided.
1. (Clarity) One of my main concerns is the clarity of the paper. While this work is understandable and easy-to-read at a high level, there is some ambiguity at low levels and many details seem to be missing. I left many questions to clarify some points that I couldn't understand; see the questions below. Additionally, I believe providing a detailed formalised algorithm and LLM prompts used for each experiment would be valuable. 2. (Method) I believe GRPO is an on-policy RL algorithm, but the a
- Novel combination of reinforcement learning, evolutionary algorithms, and in-context prompting, addressing key limitations of existing approaches (e.g., entropy collapse, lack of diversity). - Strong empirical results across a wide range of domains, including physical design and scientific optimization, demonstrating the generality of the method. - Clear framework design and well-articulated motivation connecting HELIX’s components to the nature of open-ended scientific discovery. - Ablatio
- While HELIX shows impressive performance across various scientific tasks, the paper provides limited discussion on the sensitivity of the framework to its key hyperparameters. In particular, the reward normalization constants (e.g., denominators used in Eq. 6–10 for physics tasks) and NSGA-II parameters (e.g., population size, crowding distance, KNN-based diversity metric) appear to play a crucial role in balancing exploration and exploitation, yet no analysis is offered regarding their influe
- The paper studies a setting of current interest to the community: open-ended problem solving. - The approach proposed in the paper, HELIX, is novel and interesting (though some details are unclear). - The experiments show the efficacy of the proposed method on interesting scientific reasoning tasks.
- My primary concern with the paper is with the description of the algorithm. It's unclear how NSGA-II is integrated into the pipeline. For instance, in Figure 2, the solutions selected via NSGA-II are passed to the RL algorithm as feedback. How does that work? - While the experiments are interesting, on many tasks, the Direct Prompt baseline works pretty well already. Additionally, on some tasks, Open Evolve performs worse than Direct Prompt. The authors should use more challenging domains to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
