Evolutionary Search for Automated Design of Uncertainty Quantification Methods
Mikhail Seleznyov, Daniil Korbut, Viktor Moskvoretskii, Oleg Somov, Alexander Panchenko, Elena Tutubalina

TL;DR
This paper introduces an automated evolutionary search approach powered by large language models to design uncertainty quantification methods, outperforming manual baselines in claim verification tasks.
Contribution
It demonstrates that LLM-driven evolutionary search can automatically generate effective, interpretable UQ methods that generalize well across datasets, reducing reliance on manual design.
Findings
Evolved methods outperform manual baselines with up to 6.7% ROC-AUC improvement.
Different LLMs employ distinct evolutionary strategies, such as linear estimators and positional weighting.
Only specific models reliably leverage increased complexity for better performance, with some regressions observed.
Abstract
Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
