Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Michael Bloodgood, Benjamin Strauss

TL;DR
This paper introduces six statistical anomaly detection systems to automatically identify errors in XML-stored electronic dictionaries, improving data cleaning efficiency through various data signals and evaluations.
Contribution
The paper presents six novel systems for detecting errors in XML electronic dictionaries using diverse statistical signals and inference methods, with comprehensive evaluations.
Findings
Systems effectively detect errors using multiple data signals.
Crowdsourcing and expert annotations validate system usefulness.
Error detection improves data cleaning efficiency.
Abstract
Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
