Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text

Chaochao Zhou; Bo Yang

arXiv:2212.09044·cs.IR·June 24, 2025

Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text

Chaochao Zhou, Bo Yang

PDF

Open Access

TL;DR

Text2Struct is an end-to-end machine learning pipeline designed to extract structured data, specifically metrics and units, from unstructured text such as medical abstracts, achieving high accuracy without relying on templates.

Contribution

The paper introduces a novel annotation scheme and a complete pipeline for mining structured data from text, validated on medical abstracts with promising results.

Findings

01

Achieved a dice coefficient of 0.82 on test data.

02

Most predicted relations closely matched ground-truth annotations.

03

Demonstrated viability of extracting structured data without templates.

Abstract

Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques

MethodsTest