Measuring Agents in Production

Melissa Z. Pan; Negar Arabzadeh; Riccardo Cogo; Yuxuan Zhu; Alexander Xiong; Lakshya A Agrawal; Huanzhi Mao; Emma Shen; Sid Pallerla; Liana Patel; Shu Liu; Tianneng Shi; Xiaoyuan Liu; Jared Quincy Davis; Emmanuele Lacavalla; Alessandro Basile; Shuyi Yang; Paul Castro; Daniel Kang; Joseph E. Gonzalez; Koushik Sen; Dawn Song; Ion Stoica; Matei Zaharia; Marquita Ellis

arXiv:2512.04123·cs.CY·February 4, 2026

Measuring Agents in Production

Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro

PDF

Open Access

TL;DR

This paper provides a systematic study of how large language model-based agents are deployed in production, revealing common practices, challenges, and the reliance on simple, controllable methods based on practitioner data.

Contribution

It is the first comprehensive analysis of production deployment practices for LLM agents, based on interviews and surveys with practitioners across multiple domains.

Findings

01

Most agents execute fewer than 10 steps before human intervention.

02

Majority rely on prompting off-the-shelf models rather than weight tuning.

03

Reliability remains the top challenge addressed mainly through system design.

Abstract

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 306 practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Mobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI