Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests
Alexandra Yost, Shreyans Jain, Shivam Raval, Grant Corser, Allen Roush, Nina Xu, Jacqueline Hammack, Ravid Shwartz-Ziv, Amirali Abdullah

TL;DR
This paper introduces a psychometric framework using situational judgment tests and multidimensional item response theory to evaluate the stability and structure of AI behaviors conditioned by personas.
Contribution
It presents a novel approach for measuring stable behavioral tendencies in AI models, moving beyond superficial variations and classical self-report methods.
Findings
Persona-conditioned behaviors are stable across runs.
Latent trait scores predict external benchmarks.
MIRT reveals consistent latent behavioral structure.
Abstract
Persona conditioning is widely used to steer large language model (LLM) behavior, but it is unclear whether it induces stable behavioral structure or superficial variation. We propose a framework to measure consistent behavioral tendencies using situational judgment tests (SJTs), multidimensional item response theory (MIRT), and structured synthetic personas, treating responses as observations of latent behavioral variables. Across large-scale SJT and persona datasets, we find that persona-conditioned behaviors are stable across runs, latent trait scores predict external benchmarks (e.g., TruthfulQA, EmoBench), and MIRT reveals consistent latent structure. We validate these results through human annotation, benchmark evaluation, and internal consistency analyses. We interpret these traits not as human personality, but as stable behavioral tendencies expressed across contexts. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
