Can LLMs Infer Personality from Real World Conversations?

Jianfeng Zhu; Ruoming Jin; and Karin G. Coifman

arXiv:2507.14355·cs.CL·July 22, 2025

Can LLMs Infer Personality from Real World Conversations?

Jianfeng Zhu, Ruoming Jin, and Karin G. Coifman

PDF

TL;DR

This study evaluates the ability of state-of-the-art LLMs to infer personality traits from real-world conversations, revealing high reliability but limited validity and accuracy in psychological assessments.

Contribution

Introduces a real-world benchmark with 555 interviews and BFI-10 scores to evaluate LLMs' personality inference, highlighting current limitations and areas for improvement.

Findings

01

High test-retest reliability of models

02

Weak correlation with ground-truth personality scores

03

Limited construct validity and trait-level accuracy

Abstract

Large Language Models (LLMs) such as OpenAI's GPT-4 and Meta's LLaMA offer a promising approach for scalable personality assessment from open-ended language. However, inferring personality traits remains challenging, and earlier work often relied on synthetic data or social media text lacking psychometric validity. We introduce a real-world benchmark of 555 semi-structured interviews with BFI-10 self-report scores for evaluating LLM-based personality inference. Three state-of-the-art LLMs (GPT-4.1 Mini, Meta-LLaMA, and DeepSeek) were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference. All models showed high test-retest reliability, but construct validity was limited: correlations with ground-truth scores were weak (max Pearson's $r = 0.27$ ), interrater agreement was low (Cohen's $κ < 0.10$ ), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.