# CAPE: Context-Aware Personality Evaluation Framework for Large Language Models

**Authors:** Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki

arXiv: 2508.20385 · 2025-08-29

## TL;DR

This paper introduces CAPE, a framework for evaluating LLMs' personalities considering conversational context, revealing how history influences response consistency and personality shifts across different models.

## Contribution

It is the first to incorporate conversational history into personality evaluation of LLMs, introducing novel metrics and analyzing context effects on model responses.

## Key findings

- Conversational history improves response consistency in LLMs.
- GPT models show robustness to question order but exhibit personality shifts.
- Some models heavily depend on prior interactions for response generation.

## Abstract

Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior.   Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20385/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20385/full.md

## References

46 references — full list in the complete paper: https://tomesphere.com/paper/2508.20385/full.md

---
Source: https://tomesphere.com/paper/2508.20385