TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers
Majd Saleh, Sarra Baghdadi, St\'ephane Paquelet

TL;DR
TocBERT is a novel bidirectional transformer-based method for medical document segmentation, formulated as a named entity recognition task, achieving high accuracy in identifying titles and subtitles in clinical texts.
Contribution
It introduces a supervised transformer-based approach for text segmentation in medical documents, leveraging semantic representations and fine-tuning Bio-ClinicalBERT for improved accuracy.
Findings
Achieved 84.6% F1-score on linear segmentation
Achieved 72.8% F1-score on hierarchical segmentation
Outperformed rule-based methods in title and subtitle detection
Abstract
Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Biomedical Text Mining and Ontologies
