Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms

Osman Erman Gungor; Derak Paulsen; William Kang

arXiv:2507.14376·cs.DB·July 22, 2025

Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms

Osman Erman Gungor, Derak Paulsen, William Kang

PDF

TL;DR

SCHEMORA is a novel schema matching framework that leverages large language models and hybrid retrieval techniques to improve accuracy and scalability without requiring labeled data, setting new state-of-the-art results.

Contribution

It introduces an open-source, prompt-based schema matching approach combining LLMs with hybrid retrieval, enriching metadata, and providing practical insights for model selection.

Findings

01

Achieves 7.49% higher HitRate@5 on MIMIC-OMOP benchmark

02

Outperforms previous methods with 3.75% higher HitRate@3

03

First open-source LLM-based schema matching method

Abstract

Schema matching is essential for integrating heterogeneous data sources and enhancing dataset discovery, yet it remains a complex and resource-intensive problem. We introduce SCHEMORA, a schema matching framework that combines large language models with hybrid retrieval techniques in a prompt-based approach, enabling efficient identification of candidate matches without relying on labeled training data or exhaustive pairwise comparisons. By enriching schema metadata and leveraging both vector-based and lexical retrieval, SCHEMORA improves matching accuracy and scalability. Evaluated on the MIMIC-OMOP benchmark, it establishes new state-of-the-art performance, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results. To our knowledge, this is the first LLM-based schema matching method with an open-source implementation, accompanied by analysis that underscores…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.