Semantic Data Processing with Holistic Data Understanding

Youran Sun; Sepanta Zeighami; Bhavya Chopra; Shreya Shankar; Aditya G. Parameswaran

arXiv:2604.02655·cs.DB·April 6, 2026

Semantic Data Processing with Holistic Data Understanding

Youran Sun, Sepanta Zeighami, Bhavya Chopra, Shreya Shankar, Aditya G. Parameswaran

PDF

TL;DR

HoldUp is a novel method that enables semantic data processing with holistic understanding by jointly analyzing records, significantly improving accuracy over existing row-by-row approaches in real-world datasets.

Contribution

The paper introduces HoldUp, a clustering-based approach that leverages LLMs to interpret data context holistically, overcoming the limitations of traditional independent record processing.

Findings

01

HoldUp achieves up to 33% higher accuracy in classification tasks.

02

HoldUp improves scoring and clustering accuracy by 30%.

03

Experiments on 15 datasets demonstrate consistent outperformance over existing methods.

Abstract

Semantic operators have increasingly become integrated within data systems to enable processing data using Large Language Models (LLMs). Despite significant recent effort in improving these operators, their accuracy is limited due to a critical flaw in their implementation: lack of holistic data understanding. In existing systems, semantic operators often process each data record independently using an LLM, without considering data context, only leveraging LLM's dataset-agnostic interpretation of the user-provided task. However, natural language is imprecise, so a task can only be accurately performed if it is correctly interpreted in the context of the dataset. For example, for classification and scoring tasks, which are typical semantic map tasks, the standard method of processing each record row by row yields inaccurate results in a wide range of datasets. We propose HoldUp, a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.