DataVinci: Learning Syntactic and Semantic String Repairs
Mukul Singh, Jos\'e Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu,, Gust Verbruggen

TL;DR
DataVinci is an unsupervised system that detects and repairs string data errors by learning regular-expression patterns and leveraging large language models to handle both syntactic and semantic errors, improving data cleaning accuracy.
Contribution
It introduces a novel unsupervised approach combining pattern learning and language models for comprehensive string error detection and repair without user annotations.
Findings
Outperforms 7 baselines on multiple benchmarks.
Effectively handles both syntactic and semantic string errors.
Automatically derives data repairs without user input.
Abstract
String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string data can have a significant impact on real users. While prior work has explored errors in string data, proposed approaches have often been limited to error detection or require that the user provide annotations, examples, or constraints to fix the errors. Furthermore, these systems have focused independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce DataVinci, a fully unsupervised string data error detection and repair system. DataVinci learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such patterns as data errors.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Cloud Data Security Solutions
MethodsRepair
