Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Giorgio Franceschelli; Mirco Musolesi

arXiv:2502.13207·cs.CL·September 26, 2025

Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation

Giorgio Franceschelli, Mirco Musolesi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a context-based scoring method rooted in information theory to evaluate and enhance the value and originality of neural text generation, balancing diversity and quality.

Contribution

It proposes a novel score for assessing and incentivizing originality in language models, and demonstrates its effectiveness in fine-tuning models for creative tasks.

Findings

01

Improves diversity without sacrificing quality in generated text

02

Enhances originality in poetry and math problem solving

03

Effective as a reward signal in reinforcement learning

Abstract

Despite the increasing use of large language models for creative tasks, their outputs often lack diversity. Common solutions, such as sampling at higher temperatures, can compromise the quality of the results. Dealing with this trade-off is still an open challenge in designing AI systems for creativity. Drawing on information theory, we propose a context-based score to quantitatively evaluate value and originality. This score incentivizes accuracy and adherence to the request while fostering divergence from the learned distribution. We show that our score can be used as a reward in a reinforcement learning framework to fine-tune large language models for maximum performance. We validate our strategy through experiments considering a variety of creative tasks, such as poetry generation and math problem solving, demonstrating that it enhances the value and originality of the generated…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is well written and easy to understand. 2. Comprehensive empirical validation across varied tasks.

Weaknesses

1. The concepts of value and originality are not theoretically defined. The use of P(x|y) and P(y|x) is unjustified. In addition, this also leads to the second issue. 2. Using a reward model as a proxy is known to induce the reward hacking problem. The reward score can be optimized to a higher level without necessarily improving the proxied quality. Also, the metrics in evaluation (EAD, T-LCS, SBERT, etc.) might suffer from a similar problem as reward hacking (following Goodhart's law).

Reviewer 02Rating 2Confidence 4

Strengths

- Clear motivation: addressing the diversity–quality trade-off in creative LLM generation. - Application of CoVO within reinforcement learning is technically sound and compatible with modern LLM fine-tuning frameworks (e.g., GRPO). - The experiments span both creative (poetry) and analytic (math reasoning) domains, suggesting some generality.

Weaknesses

- Lack of comparative evaluation. The paper does not benchmark CoVO against existing novelty/diversity metrics (e.g., Diversity Is All You Need, intrinsic rewards, novelty search). - All experiments rely on similar foundation model families (llama). Experiments on a different family of foundation models would be useful (e.g., Qwen). - Qualitative analysis or examples demonstrating that outputs are actually more creative or original would be useful to see. - Reported improvements are small and no

Reviewer 03Rating 4Confidence 4

Strengths

- The tension between quality and diversity in LLM generation is important, and balancing value with originality is a meaningful objective. - Experiments span multiple domains. Testing on poetry, mathematics, and NoveltyBench shows effort to validate across different creative tasks. - The proposed method is simple. Meanwhile, the paper provides detailed implementation guidance, e.g., how to compute the score with autoregressive models and integrate it with GRPO. - The release of GutenVerse dat

Weaknesses

- My main concern is that this paper compares with no existing work on improving creativity of language models, such as DivPO, DRA-GRPO, and DARLING. - The leap from mutual information to "creativity" lacks rigorous justification. I don't get the claim that log p(x|y) measures "value" - if y is relevant to x but y itself is ungrammatical/low-quality, won't log p(x|y) still be high? - Computing p(x|y) for autoregressive models requires a workaround (adding prompt q to make y' = y + q) and is not

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling