When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation
Jinlong Liu, Mohammed Bahja, Mark Lee

TL;DR
This paper constructs a large dataset for long-form literary review generation based on TTCW scores, revealing that reasoning supervision may hinder performance in fixed-format rubric-based review tasks.
Contribution
It introduces a new dataset of over 260,000 stories with TTCW-based annotations and evaluates fine-tuned Qwen3 models, highlighting the limited benefits of reasoning supervision.
Findings
Non-reasoning fine-tuning outperforms reasoning-supervised models.
Reasoning supervision increases parse failures and irrelevant outputs.
Fixed-format review generation remains challenging despite fine-tuning.
Abstract
Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
