MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time
Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, Andrew McCallum

TL;DR
MiGrATe introduces an online test-time training method using mixed-policy group construction with on-policy and off-policy data selection to improve large language model optimization without external training data.
Contribution
The paper proposes MiGrATe, a novel online TTT approach that combines on-policy sampling with greedy and neighborhood off-policy techniques for better adaptation of LLMs.
Findings
Outperforms inference-only baselines on multiple tasks
Effective in diverse domains like word search and molecule optimization
Demonstrates potential for complex search tasks without external data
Abstract
Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper's most significant contribution is its generalizable approach that eliminates the need for handcrafted, task-specific training data, which has been a major limitation of prior test-time training methods. All training signals in MIGRATE are model-generated, making the method applicable across diverse domains without requiring domain expertise to curate training examples, particularly helpful for low data domains. The design of the three-component mixed-policy strategy is well-motivate
The computational cost is not thoroughly discussed, which is a significant oversight for a test-time training method. While runtime is mentioned in the appendix (for example, 51 minutes per ARC task on an A100 GPU), there is insufficient analysis comparing the cost versus inference-only methods, examining memory requirements for LoRA fine-tuning, assessing scalability to larger models or longer horizons, or analyzing trade-offs between TTT overhead and solution quality improvements. The method
1. I think this paper presents a novel perspective for LLMs as Optimizers works such as OPRO. The authors locate the exploration and exploitation imbalance in such in-context learning approaches and make an interesting try on using TTT to rebalance such tradeoff. 2. I appreciate the authors provide the code for reproducibility checking.
1. While I acknowedge that the overall methodology the authors have proposed are solid and interesting (self-supervision), I have to say that I can not see real and practical value of this work for real-world optimization problems. I can understand that the authors may not be long-standing optimization researchers, however, in realistic scenario, using the method you provide may not be practical since it requires training LLMs for solving one problem. I found this is quite opposite to existing
1. The introduction of greedy and neighborhood sampling from a historical database reduces reward sparsity in complex optimization scenarios and lowers the expertise required for offline data preparation. 2. The proposed method outperforms GRPO-based, test-time training variants. The authors also provide a detailed analysis of the exploration-exploitation tradeoff in the mixed sampling method.
1. At the beginning of the solution search with MiGrATe, the solutions in the database might not be high-performing, as might the top-k solutions and the neighborhood solutions derived from them. Since a large proportion of solutions may be derived from the database initially (greedy sampling + neighborhood sampling), MiGrATe's performance might be significantly influenced by the initial sampling, which can be variable. This could be problematic when the reward is sparse and on-policy sampling f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Machine Learning and Data Classification · Machine Learning and Algorithms
