Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao; Xiting Wang; Yuzhuo Yuan; Yahui Liu; Fang Luo; Ruihua Song

arXiv:2505.19236·cs.CL·January 30, 2026

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song

PDF

2 Models 2 Datasets 3 Reviews

TL;DR

This paper introduces CreataSet, a large dataset for evaluating textual creativity, and CrEval, an LLM-based evaluator that aligns well with human judgments, improving creativity assessment across diverse domains.

Contribution

It presents a novel pairwise-comparison framework and a large-scale dataset, CreataSet, to train a more accurate LLM-based creativity evaluator, CrEval.

Findings

01

CrEval outperforms existing methods in aligning with human judgments.

02

Training on both human and synthetic data enhances evaluator robustness.

03

CrEval effectively boosts LLMs' creative output.

Abstract

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

The experiments are fine. CrEval consistently outperforms strong baselines including large proprietary models across proposed metrics. The paper includes appropriate ablations, along with OOD tests on external datasets. The authors further show CrEval can be used to improve model creativity

Weaknesses

- The paper mentions constructing tuples (I, R1, ..., Rk) but does not specify the exact value of k used in experiments. How many responses are generated and used per instruction? Does this vary across data sources? - In constructing CreataSet-Ext, they prompt two models to generate more responses for augmenting each instruction. But there is no testing of whether these k responses actually exhibit meaningful diversity. If these responses are similar to each other, it could limit the model’s ab

Reviewer 02Rating 4Confidence 3

Strengths

- A reasonably well-curated dataset with good mix of topics for coverage. - Good amount of human labels, and put into good use in training evaluator models with good comparison with other metrics and very extensive model lineup. I appreciate the authors showing many proprietary results. - Evaluation of the trained evaluator is complete and convincing.

Weaknesses

- The definition of creativity, which is subjective, should be detailed better in this work. This is a key bottleneck of this work's quality and rigor. - CrEval are comparison pairs of creativity. However, there might be some value to an absolute scale of creativity, especially if we want to rank model responses quickly. Motivation here is less clear. - This work suffers from a few overstatements: - The paper prides itself over context awareness (i.e., showing a prompt when evaluating respon

Reviewer 03Rating 6Confidence 3

Strengths

1) Large-scale and multi-domain dataset. CreataSet includes 100K+ human-level and 1M+ synthetic creative instruction-response pairs across 87 domains, which is promising in providing a scalable fundation for studying creative generation and evaluation. 2) Improved human label protocol. The proposed context-aware pairwise comparison protocol improves inter-annotator consistency (evaluated by ICC). 3) Comprehensive experiments. Multiple metics are applied for providing an through evaluation, such

Weaknesses

1. The rules for quantifying creativity are not differentiated across different domains.For example, creativity is manifested differently in poetry and scientific writing. Future work should further differentiate the measurement of creativity for each domain. 2. Insufficient example/failure case analysis. The paper presents overall statistics but does not systematically list typical examples of discrepancies between CrEval and human behavior. 3. Insufficient generalization analysis. The paper la

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.