Adaptive Simulation Experiment for LLM Policy Optimization
Mingjie Hu, Siyang Gao, Jian-qiang Hu, Enlu Zhou

TL;DR
This paper introduces an adaptive simulation experiment framework for optimizing policies in large language models, improving efficiency and performance through a novel pairwise comparison approach.
Contribution
It develops a new adaptive experimental procedure, LLM-PO, for identifying optimal policies in LLMs with theoretical guarantees and practical improvements.
Findings
LLM-PO outperforms benchmark methods in experiments.
The framework characterizes data requirements for policy identification.
Optimal sampling proportions are derived for different policy spaces.
Abstract
Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
