Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

Sahaana Suri; Ihab F. Ilyas; Christopher R\'e; Theodoros Rekatsinas

arXiv:2106.01501·cs.DB·June 4, 2021·1 cites

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

Sahaana Suri, Ihab F. Ilyas, Christopher R\'e, Theodoros Rekatsinas

PDF

Open Access 1 Repo

TL;DR

Ember is a system that automates context enrichment in structured data by using Transformer-based embeddings for keyless joins, enabling no-code ML pipelines across multiple domains with significant performance improvements.

Contribution

Ember introduces a novel no-code system that automates keyless joins through learned embeddings, facilitating easier context enrichment in structured data for ML pipelines.

Findings

01

Enables no-code ML pipelines for five domains.

02

Improves recall by up to 39% over alternatives.

03

Requires minimal configuration, often just one line.

Abstract

Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is tedious, domain-specific, and lacks support in now-prevalent no-code ML systems that let users create ML pipelines using just input data and high-level configuration files. In response, we propose Ember, a system that abstracts and automates keyless joins to generalize context enrichment. Our key insight is that Ember can enable a general keyless join operator by constructing an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sahaana/ember
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Time Series Analysis and Forecasting