Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Sidi Chang; Peiying Zhu; Yuxiao Chen; Rongdong Chai

arXiv:2604.27374·cs.AI·May 1, 2026

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Sidi Chang, Peiying Zhu, Yuxiao Chen, Rongdong Chai

PDF

TL;DR

This paper investigates how measurement risks, such as rubric wording and metric choice, affect the reliability of supervised financial NLP benchmarks, emphasizing the need for careful evaluation governance.

Contribution

It introduces a reporting discipline for financial NLP benchmarks, highlighting the impact of rubric and metric sensitivity on model evaluation.

Findings

01

Rubric wording significantly alters model labels, especially near decision boundaries.

02

Not all metrics are reliable; some are too easy or too noisy under the class distribution.

03

Ranking claims are more robust after identifying and using the appropriate metrics.

Abstract

As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy. We study this measurement risk on Japanese Financial Implicit-Commitment Recognition (JF-ICR; a pinned 253-item test split x 4 frontier LLMs x 5 rubrics x 3 temperatures x 5 ordinal metrics). Three findings follow. First, rubric wording materially changes model-assigned labels: R2--R3 agreement ranges from 70.0% to 83.4%, with the dominant movement near the +1 / 0 implicit-commitment boundary. This pattern is consistent with a pragmatic-boundary interpretation, but is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.