FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Jamil Zaghir; Mina Bjelogrlic; Jean-Philippe Goldman; Souka\"ina Aananou; Christophe Gaudet-Blavignac; Christian Lovis

arXiv:2309.10770·cs.CL·July 22, 2025·2 cites

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Jamil Zaghir, Mina Bjelogrlic, Jean-Philippe Goldman, Souka\"ina Aananou, Christophe Gaudet-Blavignac, Christian Lovis

PDF

Open Access

TL;DR

This paper presents a crosslingual BERT-based annotation projection method to efficiently generate high-quality annotated datasets for low-resource languages, demonstrated by creating the large French clinical NLP corpus FRASIMED.

Contribution

It introduces a novel crosslingual annotation projection approach using BERT, enabling the creation of large annotated datasets with minimal human effort.

Findings

01

High accuracy in dataset annotation demonstrated

02

Effective increase of low-resource corpora with minimal effort

03

FRASIMED is the largest open annotated French medical corpus

Abstract

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification