Towards Scalable Schema Mapping using Large Language Models

Christopher Buss; Mahdis Safari; Arash Termehchy; Stefan Lee; David Maier

arXiv:2505.24716·cs.DB·June 2, 2025

Towards Scalable Schema Mapping using Large Language Models

Christopher Buss, Mahdis Safari, Arash Termehchy, Stefan Lee, David Maier

PDF

TL;DR

This paper explores leveraging large language models to improve the scalability and automation of schema mapping in data integration, addressing key challenges like inconsistency, expressiveness, and computational cost.

Contribution

It introduces methods to stabilize LLM outputs, enhances mapping expressiveness within context limits, and reduces computational costs for scalable schema matching.

Findings

01

Sampling and aggregation improve output consistency.

02

Data type prefiltering reduces LLM call costs.

03

Proposed strategies enable more expressive schema mappings.

Abstract

The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.