TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Yizhu Jiao; Sha Li; Sizhe Zhou; Heng Ji; Jiawei Han

arXiv:2510.24014·cs.CL·October 31, 2025

TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han

PDF

TL;DR

TEXT2DB introduces an integration-aware information extraction framework that adapts to diverse database schemas using large language model agents, enabling dynamic database updates based on user instructions and document sets.

Contribution

The paper presents a novel formulation of IE called TEXT2DB focusing on database integration, along with the OPAL LLM agent framework for schema-adaptive extraction and database updating.

Findings

01

OPAL successfully adapts to various database schemas.

02

The benchmark evaluates data infilling, row population, and column addition tasks.

03

Challenges include handling large databases and extraction hallucination.

Abstract

The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.