Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models
Tabinda Sarwar, Antonio Jose Jimeno Yepes, Lawrence Cavedon

TL;DR
This study investigates how textual data quality affects feature representation and machine learning performance, demonstrating that low-quality data significantly degrades model accuracy, especially at error rates above 10%.
Contribution
It introduces a token-level error metric and utilizes a large language model to evaluate and improve dataset quality, highlighting the importance of data quality assessment in real-world applications.
Findings
Models perform well with error rates below 10%.
Performance declines sharply at error rates above 10%.
LLMs can effectively detect and correct textual errors.
Abstract
Background: Data collected in controlled settings typically results in high-quality datasets. However, in real-world applications, the quality of data collection is often compromised. It is well established that the quality of a dataset significantly impacts the performance of machine learning models. Methods: A rudimentary error rate metric was developed to evaluate textual dataset quality at the token level. Mixtral Large Language Model (LLM) was used to quantify and correct errors in low quality datasets. The study analyzed two healthcare datasets: the high-quality MIMIC-III public hospital dataset and a lower-quality private dataset from Australian aged care homes. Errors were systematically introduced into MIMIC at varying rates, while the ACH dataset quality was improved using the LLM. Results: For the sampled 35,774 and 6,336 patients from the MIMIC and ACH datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management
