Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation
Runhui Wang, Yuliang Li, Jin Wang

TL;DR
Sudowoodo is a contrastive self-supervised learning framework that unifies multiple data integration and preparation tasks, reducing the need for labeled data and enabling versatile, high-performance solutions across various DI&P applications.
Contribution
The paper introduces Sudowoodo, a novel contrastive learning framework that unifies diverse DI&P tasks under a single problem definition, enabling label-efficient, adaptable data representations.
Findings
Achieves state-of-the-art results in entity matching with minimal supervision
Outperforms specialized solutions in data cleaning and semantic type detection
Demonstrates high versatility across multiple DI&P tasks
Abstract
Machine learning (ML) is playing an increasingly important role in data management tasks, particularly in Data Integration and Preparation (DI&P). The success of ML-based approaches, however, heavily relies on the availability of large-scale, high-quality labeled datasets for different tasks. Moreover, the wide variety of DI&P tasks and pipelines oftentimes requires customizing ML solutions which can incur a significant cost for model engineering and experimentation. These factors inevitably hold back the adoption of ML-based approaches to new domains and tasks. In this paper, we propose Sudowoodo, a multi-purpose DI&P framework based on contrastive representation learning. Sudowoodo features a unified, matching-based problem definition capturing a wide range of DI&P tasks including Entity Matching (EM) in data integration, error correction in data cleaning, semantic type detection in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Time Series Analysis and Forecasting
MethodsContrastive Learning
