Text embedding models can be great data engineers

Iman Kazemian; Paritosh Ramanan; Murat Yildirim

arXiv:2505.14802·cs.LG·May 22, 2025

Text embedding models can be great data engineers

Iman Kazemian, Paritosh Ramanan, Murat Yildirim

PDF

Open Access

TL;DR

This paper introduces ADEPT, an automated data engineering pipeline using text embeddings that simplifies and improves predictive modeling across various domains by reducing reliance on traditional, labor-intensive data processing steps.

Contribution

The paper presents ADEPT, a novel framework that leverages text embeddings and a variational information bottleneck to automate and enhance data engineering for predictive analytics.

Findings

01

ADEPT outperforms existing benchmarks across multiple datasets.

02

It provides robust predictive performance despite data quality issues.

03

ADEPT simplifies data pipelines, reducing engineering effort.

Abstract

Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature extraction, and feature engineering. In this paper, we propose ADEPT, an automated data engineering pipeline via text embeddings. At the core of the ADEPT framework is a simple yet powerful idea that the entropy of embeddings corresponding to textually dense raw format representation of time series can be intuitively viewed as equivalent (or in many cases superior) to that of numerically dense vector representations obtained by data engineering pipelines. Consequently, ADEPT uses a two step approach that (i) leverages text embeddings to represent the diverse data sources, and (ii) constructs a variational information bottleneck criteria to mitigate entropy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSparse Evolutionary Training