Automatic String Data Validation with Pattern Discovery
Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto, Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin

TL;DR
This paper introduces an automatic string data validation system that uses pattern discovery to verify data correctness in enterprise pipelines, enabling early error detection and reducing manual troubleshooting efforts.
Contribution
It presents a novel self-validate data management system with incremental pattern discovery techniques tailored for semi-structural string data in large-scale enterprise environments.
Findings
Effective detection of erroneous data in industrial datasets
High accuracy in pattern discovery and validation
Significant reduction in manual data error investigation
Abstract
In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Network Packet Processing and Optimization
