FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance
Yu-Shiang Huang, Yun-Yu Lee, Tzu-Hsin Chou, Che Lin, Chuan-Ju Wang

TL;DR
This paper reveals that BERTScore, a popular semantic similarity metric, poorly detects critical numerical differences in financial texts, highlighting the need for numerically-aware evaluation methods in financial NLP.
Contribution
We introduce FinNuE, a diagnostic dataset with controlled numerical perturbations, and demonstrate BERTScore's failure to capture important numerical variations in financial contexts.
Findings
BERTScore often assigns high similarity to financially divergent texts.
Embedding-based metrics lack sensitivity to numerical differences in finance.
FinNuE effectively exposes the limitations of current semantic evaluation metrics.
Abstract
BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Stock Market Forecasting Methods · Machine Learning in Healthcare
