TL;DR
This paper investigates how regional spelling differences, like color versus colour, affect neural retrieval models, revealing that models generally generalize well despite spelling biases, but normalization impacts performance variably.
Contribution
It provides a systematic analysis of the impact of regional spelling conventions on neural retrieval models and the effects of spelling normalization on their performance.
Findings
American spelling conventions are more prevalent in datasets.
Models generally generalize well despite spelling biases.
Normalization affects models differently, with lexical models improving and dense retrievers unaffected.
Abstract
One advantage of neural ranking models is that they are meant to generalise well in situations of synonymity i.e. where two words have similar or identical meanings. In this paper, we investigate and quantify how well various ranking models perform in a clear-cut case of synonymity: when words are simply expressed in different surface forms due to regional differences in spelling conventions (e.g., color vs colour). We first explore the prevalence of American and British English spelling conventions in datasets used for the pre-training, training and evaluation of neural retrieval methods, and find that American spelling conventions are far more prevalent. Despite these biases in the training data, we find that retrieval models often generalise well in this case of synonymity. We explore the effect of document spelling normalisation in retrieval and observe that all models are affected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
