Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems
Rebecca L. Johnson

TL;DR
This paper proposes a sociotechnical, process-oriented framework for evaluating generative AI, emphasizing the enactment of values through interactions rather than static benchmarks.
Contribution
It introduces MaSH Loops for recursive evaluation and a World Values Benchmark based on survey data, demonstrating their application in real-world cases.
Findings
Value drift observed in early GPT-3 evaluations.
Sociotechnical evaluation provides deeper insights than traditional benchmarks.
Evaluation influences governance and trust in AI systems.
Abstract
In measurement theory, instruments do not simply record reality; they help constitute what is observed. The same holds for generative AI evaluation: benchmarks do not just measure, they shape what models appear to be. Functionalist benchmarks treat models as isolated predictors, while prescriptive approaches assess what systems ought to be. Both obscure the sociotechnical processes through which meaning and values are enacted, risking the reification of narrow cultural perspectives in pluralist contexts. This thesis advances a descriptive alternative. It argues that generative AI must be evaluated as a pluralist sociotechnical system and develops Machine-Society-Human (MaSH) Loops, a framework for tracing how models, users, and institutions recursively co-construct meaning and values. Evaluation shifts from judging outputs to examining how values are enacted in interaction. Three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
