Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario
Casimiro Pio Carrino, Jordi Armengol-Estap\'e, Asier, Guti\'errez-Fandi\~no, Joan Llop-Palao, Marc P\`amies, Aitor Gonzalez-Agirre,, Marta Villegas

TL;DR
This paper introduces the first biomedical and clinical transformer-based language models for Spanish, demonstrating that domain-specific pretraining significantly improves performance on biomedical NER tasks in a mid-resource setting.
Contribution
It presents novel Spanish biomedical and clinical language models trained with domain-specific pretraining, including mixed-domain and cross-domain transfer approaches, outperforming existing models.
Findings
Models outperform mBERT and BETO in NER tasks
Domain-specific pretraining enhances downstream performance
Vocabulary choices significantly impact NER results
Abstract
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model suitable for real-world clinical data. We evaluated our models on Named Entity Recognition (NER) tasks for biomedical documents and challenging hospital discharge reports. When compared against the competitive mBERT and BETO models, we outperform them in all NER tasks by a significant margin. Finally, we studied the impact of the model's vocabulary on the NER performances by offering an interesting vocabulary-centric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsmBERT
