Leveraging sequence-to-sequence models for semantic annotation of Dutch pathology reports
M. Siepel, G.T.N. Burger, Q.J.M. Voorham, R. Cornet, I. Calixto, I. Vagliano

TL;DR
This paper explores using AI models to automatically annotate Dutch pathology reports, showing good results for simpler reports but challenges with complex ones.
Contribution
The study introduces a T5-based model pre-trained on Dutch pathology data (PaTh5.NL) and evaluates constrained decoding for better annotation accuracy.
Findings
Fine-tuned PaTh5.NL models outperformed mT5 in shorter reports but struggled with complex texts.
Constrained decoding did not consistently improve patient retrieval despite higher BLEU scores.
Annotation quality declines with report complexity, especially in histology and autopsy reports.
Abstract
Palga Foundation is responsible for indexing Dutch pathology data across the Netherlands, which relies on annotations of pathology reports. These annotations, derived from the conclusion text, consist of codes from the Palga thesaurus, serving patient care and scientific research. However, manual annotation by pathologists is both labor-intensive and prone to errors. Therefore, in this study, we seek to leverage sequence-to-sequence transformer models, particularly Text-To-Text Transfer Transformer (T5)-based models, to generate these annotations. Additionally, we investigate a constrained decoding (CD) approach that encodes domain knowledge. We compare a standard multilingual T5 model (mT5) with our own T5 model (PaTh5.NL) pre-trained using Palga data with the goal of better aligning the model's learned representations with the specific structure, terminology, and annotation…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Machine Learning in Healthcare
