GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

Jie JW Wu; Ayanda Patrick Herlihy; Ahmad Saleem Mirza; Ali Afoud; and Fatemeh Fard

arXiv:2511.00802·cs.SE·November 4, 2025

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, and Fatemeh Fard

PDF

Open Access

TL;DR

This paper introduces GrowthHacker, an LLM-based framework that automates code optimization for off-policy evaluation, significantly improving accuracy and reliability in real-world datasets, with potential to scale data-driven decision-making.

Contribution

It presents a novel benchmark and agent framework leveraging LLMs for automated code optimization in off-policy evaluation, demonstrating substantial performance gains.

Findings

01

Two_agent achieves 100% reliability in OPE tasks.

02

Highest average improvement of 106.7% among positive outcomes.

03

Outperforms existing methods like AutoGen in success rate.

Abstract

With the software industry shifting toward a data-driven culture, online A/B testing is a key tool for evaluating new technologies. However, deploying such experiments requires substantial resources, may negatively impact users, and involves long data collection periods. To address this, \textit{off-policy evaluation (OPE)}, or offline A/B testing, uses logged data to assess technologies and is fundamental in Reinforcement Learning, making it crucial in domains where online testing is costly or risky, such as healthcare, recommender systems, education, dialog systems, and robotics. Despite advances in coding LLMs and agentic AI, little is known about leveraging them to optimize OPE results. We investigate whether LLMs and LLM-based agents can improve OPE performance via code optimization. We propose \textit{GrowthHacker}, a benchmark with agent and baseline methods on large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Mobile Crowdsensing and Crowdsourcing · Software Engineering Techniques and Practices