Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests

Alexandra Yost; Shreyans Jain; Shivam Raval; Grant Corser; Allen Roush; Nina Xu; Jacqueline Hammack; Ravid Shwartz-Ziv; Amirali Abdullah

arXiv:2510.22170·cs.AI·May 12, 2026

Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests

Alexandra Yost, Shreyans Jain, Shivam Raval, Grant Corser, Allen Roush, Nina Xu, Jacqueline Hammack, Ravid Shwartz-Ziv, Amirali Abdullah

PDF

4 Datasets

TL;DR

This paper introduces a psychometric framework using situational judgment tests and multidimensional item response theory to evaluate the stability and structure of AI behaviors conditioned by personas.

Contribution

It presents a novel approach for measuring stable behavioral tendencies in AI models, moving beyond superficial variations and classical self-report methods.

Findings

01

Persona-conditioned behaviors are stable across runs.

02

Latent trait scores predict external benchmarks.

03

MIRT reveals consistent latent behavioral structure.

Abstract

Persona conditioning is widely used to steer large language model (LLM) behavior, but it is unclear whether it induces stable behavioral structure or superficial variation. We propose a framework to measure consistent behavioral tendencies using situational judgment tests (SJTs), multidimensional item response theory (MIRT), and structured synthetic personas, treating responses as observations of latent behavioral variables. Across large-scale SJT and persona datasets, we find that persona-conditioned behaviors are stable across runs, latent trait scores predict external benchmarks (e.g., TruthfulQA, EmoBench), and MIRT reveals consistent latent structure. We validate these results through human annotation, benchmark evaluation, and internal consistency analyses. We interpret these traits not as human personality, but as stable behavioral tendencies expressed across contexts. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.