Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments
Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

TL;DR
This paper empirically evaluates the psychometric reliability of AI-derived user state metrics from large language models, highlighting stability issues at the individual score level and proposing a framework for their validation in adaptive systems.
Contribution
It introduces a replicable evaluation framework for assessing the reliability of user state metrics from LLMs, emphasizing the importance of validation for responsible AI system design.
Findings
Only 31 of 213 metrics met reliability criteria.
Unstable metrics can still be useful in post-hoc analyses.
Reliability issues challenge real-time interpretation of user states.
Abstract
The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
