Text embedding models can be great data engineers
Iman Kazemian, Paritosh Ramanan, Murat Yildirim

TL;DR
This paper introduces ADEPT, an automated data engineering pipeline using text embeddings that simplifies and improves predictive modeling across various domains by reducing reliance on traditional, labor-intensive data processing steps.
Contribution
The paper presents ADEPT, a novel framework that leverages text embeddings and a variational information bottleneck to automate and enhance data engineering for predictive analytics.
Findings
ADEPT outperforms existing benchmarks across multiple datasets.
It provides robust predictive performance despite data quality issues.
ADEPT simplifies data pipelines, reducing engineering effort.
Abstract
Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature extraction, and feature engineering. In this paper, we propose ADEPT, an automated data engineering pipeline via text embeddings. At the core of the ADEPT framework is a simple yet powerful idea that the entropy of embeddings corresponding to textually dense raw format representation of time series can be intuitively viewed as equivalent (or in many cases superior) to that of numerically dense vector representations obtained by data engineering pipelines. Consequently, ADEPT uses a two step approach that (i) leverages text embeddings to represent the diverse data sources, and (ii) constructs a variational information bottleneck criteria to mitigate entropy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsSparse Evolutionary Training
