When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Jinlong Liu; Mohammed Bahja; Mark Lee

arXiv:2605.20364·cs.CL·May 21, 2026

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Jinlong Liu, Mohammed Bahja, Mark Lee

PDF

1 Repo 1 Datasets

TL;DR

This paper constructs a large dataset for long-form literary review generation based on TTCW scores, revealing that reasoning supervision may hinder performance in fixed-format rubric-based review tasks.

Contribution

It introduces a new dataset of over 260,000 stories with TTCW-based annotations and evaluates fine-tuned Qwen3 models, highlighting the limited benefits of reasoning supervision.

Findings

01

Non-reasoning fine-tuning outperforms reasoning-supervised models.

02

Reasoning supervision increases parse failures and irrelevant outputs.

03

Fixed-format review generation remains challenging despite fine-tuning.

Abstract

Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vince-liuss/TTCW-based-Review
github

Datasets

VibrantVista/TTCW-Based-Review
dataset· 191 dl
191 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.