MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
Ikram Belmadani, Oumaima El Khettari, Pac\^ome Constant dit Beaufils, Benoit Favre, Richard Dufour

TL;DR
This paper introduces MedInjection-FR, a large French biomedical instruction dataset, and systematically evaluates how native, synthetic, and translated data sources impact instruction tuning of LLMs, highlighting the importance of data authenticity and diversity.
Contribution
The study provides a comprehensive analysis of data provenance effects on biomedical instruction tuning, emphasizing the benefits of combining native, synthetic, and translated data sources.
Findings
Native data yields the best performance.
Mixed data sources, especially native and translated, are complementary.
Synthetic data alone is less effective but beneficial when combined with native data.
Abstract
Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Artificial Intelligence in Healthcare and Education
