MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Ikram Belmadani; Oumaima El Khettari; Pac\^ome Constant dit Beaufils; Benoit Favre; Richard Dufour

arXiv:2603.06905·cs.CL·March 10, 2026

MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Ikram Belmadani, Oumaima El Khettari, Pac\^ome Constant dit Beaufils, Benoit Favre, Richard Dufour

PDF

Open Access 6 Models 1 Datasets

TL;DR

This paper introduces MedInjection-FR, a large French biomedical instruction dataset, and systematically evaluates how native, synthetic, and translated data sources impact instruction tuning of LLMs, highlighting the importance of data authenticity and diversity.

Contribution

The study provides a comprehensive analysis of data provenance effects on biomedical instruction tuning, emphasizing the benefits of combining native, synthetic, and translated data sources.

Findings

01

Native data yields the best performance.

02

Mixed data sources, especially native and translated, are complementary.

03

Synthetic data alone is less effective but beneficial when combined with native data.

Abstract

Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

MedInjection/ALL
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Artificial Intelligence in Healthcare and Education