Schema Matching with Large Language Models: an Experimental Study
Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters,, Stijn Vansummeren

TL;DR
This study evaluates the effectiveness of large language models in schema matching tasks, demonstrating their potential to assist data engineers by identifying semantic correspondences using only schema names and descriptions.
Contribution
The paper introduces a benchmark and various prompting strategies for LLM-based schema matching, analyzing their performance and limitations in a health domain dataset.
Findings
LLMs' matching quality is sensitive to context scope
Newer LLM versions improve decisiveness
Certain prompting strategies balance verification effort and matching success
Abstract
Large Language Models (LLMs) have shown useful applications in a variety of tasks, including data wrangling. In this paper, we investigate the use of an off-the-shelf LLM for schema matching. Our objective is to identify semantic correspondences between elements of two relational schemas using only names and descriptions. Using a newly created benchmark from the health domain, we propose different so-called task scopes. These are methods for prompting the LLM to do schema matching, which vary in the amount of context information contained in the prompt. Using these task scopes we compare LLM-based schema matching against a string similarity baseline, investigating matching quality, verification effort, decisiveness, and complementarity of the approaches. We find that matching quality suffers from a lack of context information, but also from providing too much context information. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
