Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

Izabella Krzeminska; Michal Butkiewicz; Ewa Komkowska

arXiv:2605.15734·cs.AI·May 18, 2026

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

PDF

TL;DR

This paper empirically evaluates the psychometric reliability of AI-derived user state metrics from large language models, highlighting stability issues at the individual score level and proposing a framework for their validation in adaptive systems.

Contribution

It introduces a replicable evaluation framework for assessing the reliability of user state metrics from LLMs, emphasizing the importance of validation for responsible AI system design.

Findings

01

Only 31 of 213 metrics met reliability criteria.

02

Unstable metrics can still be useful in post-hoc analyses.

03

Reliability issues challenge real-time interpretation of user states.

Abstract

The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.