Schemora: schema matching via multi-stage recommendation and metadata enrichment using off-the-shelf llms
Osman Erman Gungor, Derak Paulsen, William Kang

TL;DR
SCHEMORA is a novel schema matching framework that leverages large language models and hybrid retrieval techniques to improve accuracy and scalability without requiring labeled data, setting new state-of-the-art results.
Contribution
It introduces an open-source, prompt-based schema matching approach combining LLMs with hybrid retrieval, enriching metadata, and providing practical insights for model selection.
Findings
Achieves 7.49% higher HitRate@5 on MIMIC-OMOP benchmark
Outperforms previous methods with 3.75% higher HitRate@3
First open-source LLM-based schema matching method
Abstract
Schema matching is essential for integrating heterogeneous data sources and enhancing dataset discovery, yet it remains a complex and resource-intensive problem. We introduce SCHEMORA, a schema matching framework that combines large language models with hybrid retrieval techniques in a prompt-based approach, enabling efficient identification of candidate matches without relying on labeled training data or exhaustive pairwise comparisons. By enriching schema metadata and leveraging both vector-based and lexical retrieval, SCHEMORA improves matching accuracy and scalability. Evaluated on the MIMIC-OMOP benchmark, it establishes new state-of-the-art performance, with gains of 7.49% in HitRate@5 and 3.75% in HitRate@3 over previous best results. To our knowledge, this is the first LLM-based schema matching method with an open-source implementation, accompanied by analysis that underscores…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
