MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers
Fernanda Bufon F\"arber, Iago Alves Brito, Julia Soares Dollis, Pedro Schindler Freire Brasil Ribeiro, Rafael Teixeira Sousa, Arlindo Rodrigues Galv\~ao Filho

TL;DR
MedPT is a large, curated dataset of Brazilian Portuguese medical questions and answers, enabling improved language models for healthcare in underrepresented languages with rich clinical nuances.
Contribution
This paper introduces MedPT, the first extensive Brazilian Portuguese medical QA dataset, with multi-stage curation and semantic annotation, advancing language model development for healthcare in low-resource languages.
Findings
Achieved 94% F1-score in medical specialty classification
Demonstrated dataset's semantic richness through error analysis
Enabled development of culturally-aware medical NLP models
Abstract
While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages. This creates a critical barrier for other languages, as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus of patient-doctor interactions for the Brazilian Portuguese medical domain. Comprising 384,095 authentic question-answer pairs and covering over 3,200 distinct health-related conditions, the dataset was refined through a rigorous multi-stage curation protocol that employed a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries, resulting in a corpus of approximately 57 million tokens. We further utilize of LLM-driven annotation to classify queries into seven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education
