Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR
Sidi Chang, Peiying Zhu, Yuxiao Chen, Rongdong Chai

TL;DR
This paper investigates how measurement risks, such as rubric wording and metric choice, affect the reliability of supervised financial NLP benchmarks, emphasizing the need for careful evaluation governance.
Contribution
It introduces a reporting discipline for financial NLP benchmarks, highlighting the impact of rubric and metric sensitivity on model evaluation.
Findings
Rubric wording significantly alters model labels, especially near decision boundaries.
Not all metrics are reliable; some are too easy or too noisy under the class distribution.
Ranking claims are more robust after identifying and using the appropriate metrics.
Abstract
As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy. We study this measurement risk on Japanese Financial Implicit-Commitment Recognition (JF-ICR; a pinned 253-item test split x 4 frontier LLMs x 5 rubrics x 3 temperatures x 5 ordinal metrics). Three findings follow. First, rubric wording materially changes model-assigned labels: R2--R3 agreement ranges from 70.0% to 83.4%, with the dominant movement near the +1 / 0 implicit-commitment boundary. This pattern is consistent with a pragmatic-boundary interpretation, but is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
