OpenForge: Probabilistic Metadata Integration

Tianji Cong; Fatemeh Nargesian; Junjie Xing; H. V. Jagadish

arXiv:2412.09788·cs.DB·December 16, 2024

OpenForge: Probabilistic Metadata Integration

Tianji Cong, Fatemeh Nargesian, Junjie Xing, H. V. Jagadish

PDF

1 Repo

TL;DR

OpenForge is a probabilistic framework that combines large language models and Markov Random Fields to improve the accuracy and consistency of metadata relationship integration across diverse data sources.

Contribution

It introduces a novel two-stage approach for metadata integration, leveraging LLMs for priors and probabilistic graphical models for refinement, formalized as an optimization problem.

Findings

01

Outperforms GPT-4 by 25 F1-score points in metadata vocabulary matching.

02

Demonstrates effectiveness and efficiency on real-world datasets.

03

Captures relationship properties like transitivity probabilistically.

Abstract

Modern data stores increasingly rely on metadata for enabling diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenance -- ensuring its consistency, usefulness, and freshness -- has been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts from disparate sources. These relationships are critical for creating clean, consistent, and up-to-date metadata repositories, and a central challenge for metadata integration. We propose OpenForge, a two-stage prior-posterior framework for metadata integration. In the first stage, OpenForge exploits multiple methods including fine-tuned large language models to obtain prior beliefs about concept relationships. In the second stage, OpenForge refines these predictions by leveraging Markov…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

superctj/openforge
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.