Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song

TL;DR
This paper introduces CreataSet, a large dataset for evaluating textual creativity, and CrEval, an LLM-based evaluator that aligns well with human judgments, improving creativity assessment across diverse domains.
Contribution
It presents a novel pairwise-comparison framework and a large-scale dataset, CreataSet, to train a more accurate LLM-based creativity evaluator, CrEval.
Findings
CrEval outperforms existing methods in aligning with human judgments.
Training on both human and synthetic data enhances evaluator robustness.
CrEval effectively boosts LLMs' creative output.
Abstract
Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment…
Peer Reviews
Decision·ICLR 2026 Poster
The experiments are fine. CrEval consistently outperforms strong baselines including large proprietary models across proposed metrics. The paper includes appropriate ablations, along with OOD tests on external datasets. The authors further show CrEval can be used to improve model creativity
- The paper mentions constructing tuples (I, R1, ..., Rk) but does not specify the exact value of k used in experiments. How many responses are generated and used per instruction? Does this vary across data sources? - In constructing CreataSet-Ext, they prompt two models to generate more responses for augmenting each instruction. But there is no testing of whether these k responses actually exhibit meaningful diversity. If these responses are similar to each other, it could limit the model’s ab
- A reasonably well-curated dataset with good mix of topics for coverage. - Good amount of human labels, and put into good use in training evaluator models with good comparison with other metrics and very extensive model lineup. I appreciate the authors showing many proprietary results. - Evaluation of the trained evaluator is complete and convincing.
- The definition of creativity, which is subjective, should be detailed better in this work. This is a key bottleneck of this work's quality and rigor. - CrEval are comparison pairs of creativity. However, there might be some value to an absolute scale of creativity, especially if we want to rank model responses quickly. Motivation here is less clear. - This work suffers from a few overstatements: - The paper prides itself over context awareness (i.e., showing a prompt when evaluating respon
1) Large-scale and multi-domain dataset. CreataSet includes 100K+ human-level and 1M+ synthetic creative instruction-response pairs across 87 domains, which is promising in providing a scalable fundation for studying creative generation and evaluation. 2) Improved human label protocol. The proposed context-aware pairwise comparison protocol improves inter-annotator consistency (evaluated by ICC). 3) Comprehensive experiments. Multiple metics are applied for providing an through evaluation, such
1. The rules for quantifying creativity are not differentiated across different domains.For example, creativity is manifested differently in poetry and scientific writing. Future work should further differentiate the measurement of creativity for each domain. 2. Insufficient example/failure case analysis. The paper presents overall statistics but does not systematically list typical examples of discrepancies between CrEval and human behavior. 3. Insufficient generalization analysis. The paper la
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
