Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text
Chaochao Zhou, Bo Yang

TL;DR
Text2Struct is an end-to-end machine learning pipeline designed to extract structured data, specifically metrics and units, from unstructured text such as medical abstracts, achieving high accuracy without relying on templates.
Contribution
The paper introduces a novel annotation scheme and a complete pipeline for mining structured data from text, validated on medical abstracts with promising results.
Findings
Achieved a dice coefficient of 0.82 on test data.
Most predicted relations closely matched ground-truth annotations.
Demonstrated viability of extracting structured data without templates.
Abstract
Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
MethodsTest
