HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Andrew Zhuoer Feng; Cunxiang Wang; Yu Luo; Lin Fan; Yilin Zhou; Zikang Wang; Xiaotao Gu; Jie Tang; Hongning Wang; Minlie Huang

arXiv:2604.19071·cs.CL·April 22, 2026

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Lin Fan, Yilin Zhou, Zikang Wang, Xiaotao Gu, Jie Tang, Hongning Wang, Minlie Huang

PDF

TL;DR

This paper introduces Tree-of-Writing (ToW), a hierarchical evaluation method for large language models' writing, and presents HowToBench, a comprehensive Chinese writing benchmark, demonstrating ToW's superior correlation with human judgments and robustness.

Contribution

The paper proposes ToW, a tree-structured evaluation framework, and introduces HowToBench, a large-scale Chinese writing benchmark for assessing LLMs across multiple genres and tasks.

Findings

01

ToW achieves 0.93 Pearson correlation with human judgments.

02

ToW is robust against textual disturbances affecting other metrics.

03

Input length negatively correlates with content scores in the Guide task.

Abstract

Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.