Mind the Data Gap: Bridging LLMs to Enterprise Data Integration
Moe Kayali, Fabian Wenz, Nesime Tatbul, \c{C}a\u{g}atay, Demiralp

TL;DR
This paper highlights the performance gap of large language models on real-world enterprise data, introduces the GOBY Benchmark to evaluate enterprise data integration, and proposes techniques to improve LLM performance in enterprise settings.
Contribution
It introduces the GOBY Benchmark for enterprise data, and proposes novel techniques to enhance LLM performance on private enterprise datasets.
Findings
LLMs perform worse on enterprise data than public data
The GOBY Benchmark reveals the performance gap
Proposed techniques close the performance gap
Abstract
Leading large language models (LLMs) are trained on public data. However, most of the world's data is dark data that is not publicly accessible, mainly in the form of private organizational or enterprise data. We show that the performance of methods based on LLMs seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration. Based on our experience with this enterprise benchmark, we propose techniques to uplift the performance of LLMs on enterprise data, including (1) hierarchical annotation, (2) runtime class-learning, and (3) ontology synthesis. We show that, once these techniques are deployed, the performance on enterprise data becomes on par with that of public data. The Goby benchmark can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security · ERP Systems Implementation and Impact · Stonefly species taxonomy and ecology
MethodsOntology
