Counterfactual LLM-based Framework for Measuring Rhetorical Style

Jingyi Qiu; Hong Chen; Zongyi Li

arXiv:2512.19908·cs.CL·December 24, 2025

Counterfactual LLM-based Framework for Measuring Rhetorical Style

Jingyi Qiu, Hong Chen, Zongyi Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel LLM-based counterfactual framework to quantify rhetorical style in scientific papers, revealing its impact on attention and the influence of LLM writing assistance.

Contribution

It presents a new method to measure rhetorical style independently of content using multiple LLM personas and pairwise evaluations, validated on thousands of ML papers.

Findings

01

Rhetorical strength predicts citations and media attention.

02

Rhetorical style increased sharply after 2023.

03

LLM-based writing aid largely drives the rise in rhetorical strength.

Abstract

The rise of AI has fueled growing concerns about ``hype'' in machine learning papers, yet a reliable way to quantify rhetorical style independently of substantive content has remained elusive. Because bold language can stem from either strong empirical results or mere rhetorical style, it is often difficult to distinguish between the two. To disentangle rhetorical style from substantive content, we introduce a counterfactual, LLM-based framework: multiple LLM rhetorical personas generate counterfactual writings from the same substantive content, an LLM judge compares them through pairwise evaluations, and the outcomes are aggregated using a Bradley--Terry model. Applying this method to 8,485 ICLR submissions sampled from 2017 to 2025, we generate more than 250,000 counterfactual writings and provide a large-scale quantification of rhetorical style in ML papers. We find that visionary…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Systematic triplet generation strategies: Comprehensive exploration of four different triplet curation approaches (SR-k4, SR-k10, SR-k4-UN, SR-k4+LLM) with ablation studies demonstrating their relative effectiveness - Paragraph-level contextual data: Extension beyond sentence-level datasets provides richer narrative contexts for intersectional bias detection, potentially capturing more realistic bias manifestations - Cross-domain transfer analysis: Demonstrates that models trained on Indic-Int

Weaknesses

- Positioning relative to existing work: The relationship to HInter (2025) [1], Ma et al. (2023) [2], and other intersectional bias detection frameworks requires clarification - Limited comparison with BiasAlert: BiasAlert (2024) uses retrieval-based detection with Contriever encoders [3]; the specific advantages of training the retriever versus BiasAlert's RAG approach need more detailed empirical comparison - Dataset scale: The 7,404 paragraphs represent a smaller scale compared to WinoIdentit

Reviewer 02Rating 8Confidence 3

Strengths

1. The proposed framework address the issue of entanglement of substantive content with rhetorical style, overcomes the challenges of biased measurement of rhetorical style in prior work. 2. With the multi-persona counterfactual generation & Bradley-Terry scoring approach, the proposed framework yield high quality and less biased estimation of rhetorical scores than prior approaches. 3. The analysis of the predictive power of rhetorical scores on peer-review scores and downstream impact/attentio

Weaknesses

1. The single-dimension formulation of the rhetorical strength measurement might have made an over-simplified assumption. For instance, the strength of rhetorical style might be multi-faceted: a paper might argue significant generalizability of their contributions and simultaneously put less emphasis of the novelty/impact. I am thus concerned if the single-dimension rhetorical strength could capture such variability.

Reviewer 03Rating 6Confidence 3

Strengths

I found the correlation between the rhetorical score and the “popularity” of a paper an interesting result, which may show how humans are biased by the presentation style.

Weaknesses

I don’t see any serious weaknesses. Wondering what the practical application and implications are. Shall we all adopt a writing style that results in a high rhetorical score? :) Also not clear at all why the method is called “counterfactual”.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Persona Design and Applications · Topic Modeling