Developing an efficient corpus using Ensemble Data cleaning approach
Md Taimur Ahad

TL;DR
This paper presents an ensemble data cleaning approach to improve the quality of medical text corpora, achieving 94% accuracy, thereby enhancing NLP applications in medical information retrieval.
Contribution
It introduces a novel ensemble data cleaning method for medical datasets, significantly improving accuracy over traditional single-process techniques.
Findings
Ensemble technique achieves 94% accuracy in data cleaning.
Improved corpus quality enhances medical NLP applications.
Method outperforms single-process data cleaning approaches.
Abstract
Despite the observable benefit of Natural Language Processing (NLP) in processing a large amount of textual medical data within a limited time for information retrieval, a handful of research efforts have been devoted to uncovering novel data-cleaning methods. Data cleaning in NLP is at the centre point for extracting validated information. Another observed limitation in the NLP domain is having limited medical corpora that provide answers to a given medical question. Realising the limitations and challenges from two perspectives, this research aims to clean a medical dataset using ensemble techniques and to develop a corpus. The corpora expect that it will answer the question based on the semantic relationship of corpus sequences. However, the data cleaning method in this research suggests that the ensemble technique provides the highest accuracy (94%) compared to the single process,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
