Text Data Integration

Md Ataur Rahman; Dimitris Sacharidis; Oscar Romero; Sergi Nadal

arXiv:2603.27055·cs.CL·March 31, 2026

Text Data Integration

Md Ataur Rahman, Dimitris Sacharidis, Oscar Romero, Sergi Nadal

PDF

1 Repo

TL;DR

This paper discusses the importance and challenges of integrating unstructured textual data with structured data to enhance data processing and reasoning capabilities.

Contribution

It highlights the need for integrating textual data into data integration systems and reviews current challenges, state of the art, and open problems.

Findings

01

Most existing systems focus on structured data integration.

02

Unstructured text data contains valuable knowledge for integration.

03

Integrating text data poses unique challenges and opportunities.

Abstract

Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dtim-upc/THOR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.