Semantic Data Processing with Holistic Data Understanding
Youran Sun, Sepanta Zeighami, Bhavya Chopra, Shreya Shankar, Aditya G. Parameswaran

TL;DR
HoldUp is a novel method that enables semantic data processing with holistic understanding by jointly analyzing records, significantly improving accuracy over existing row-by-row approaches in real-world datasets.
Contribution
The paper introduces HoldUp, a clustering-based approach that leverages LLMs to interpret data context holistically, overcoming the limitations of traditional independent record processing.
Findings
HoldUp achieves up to 33% higher accuracy in classification tasks.
HoldUp improves scoring and clustering accuracy by 30%.
Experiments on 15 datasets demonstrate consistent outperformance over existing methods.
Abstract
Semantic operators have increasingly become integrated within data systems to enable processing data using Large Language Models (LLMs). Despite significant recent effort in improving these operators, their accuracy is limited due to a critical flaw in their implementation: lack of holistic data understanding. In existing systems, semantic operators often process each data record independently using an LLM, without considering data context, only leveraging LLM's dataset-agnostic interpretation of the user-provided task. However, natural language is imprecise, so a task can only be accurately performed if it is correctly interpreted in the context of the dataset. For example, for classification and scoring tasks, which are typical semantic map tasks, the standard method of processing each record row by row yields inaccurate results in a wide range of datasets. We propose HoldUp, a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
