Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting

Gauri Kambhatla; Chantal Shaib; Venkata Govindarajan

arXiv:2505.17390·cs.CL·September 22, 2025

Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting

Gauri Kambhatla, Chantal Shaib, Venkata Govindarajan

PDF

1 Video

TL;DR

This paper evaluates the lexical diversity of synthetic data generated with persona prompts, revealing that while persona prompting increases diversity, fine-grained details offer limited additional benefits.

Contribution

It introduces a suite of metrics to measure diversity in persona-driven synthetic data and compares the effects of prompt granularity and model size on diversity.

Findings

01

Synthetic prompts are less diverse than human prompts.

02

Persona prompting increases lexical diversity, especially in larger models.

03

Fine-grained persona details add minimal diversity gains.

Abstract

Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. First, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. Our results indicate that persona prompting produces higher lexical diversity than prompting without personas, particularly in larger models. In contrast, adding fine-grained persona details yields minimal gains in diversity compared to simply specifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Measuring Lexical Diversity of Synthetic Data Generated through Fine-Grained Persona Prompting· underline