How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Nouran Khallaf, Serge Sharoff

TL;DR
This paper investigates the robustness of BERT-based models to noisy data in multilingual sentence difficulty detection, evaluating various denoising techniques and introducing a large multilingual corpus for this task.
Contribution
It provides a comprehensive framework for assessing denoising strategies in multilingual sentence difficulty detection and releases the largest multilingual corpus for this purpose.
Findings
GMM-based filtering significantly improves performance on small datasets.
Pre-trained models show inherent robustness, with marginal gains from denoising on large datasets.
Removing noisy sentences creates cleaner datasets, aiding in better difficulty prediction.
Abstract
Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Artificial Intelligence in Healthcare and Education
