Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Kartik Bose; Abhinandan Kumar; Raghuraman Soundararajan; Priya Mudgil; Samonee Ralmilay; Niharika Dutta; Manphool Singhal; Arun Kumar; Saugata Sen; Anurima Patra; Priya Ghosh; Abanti Das; Amit Gupta; Ashish Verma; Dipin Sudhakaran; Ekta Dhamija; Himangi Unde; Ishan Kumar; Krithika Rangarajan; Prerna Garg; Rachel Sequeira; Sudhin Shylendran; Taruna Yadav; Tej Pal; Pankaj Gupta

arXiv:2601.03232·cs.CL·January 7, 2026

Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Kartik Bose, Abhinandan Kumar, Raghuraman Soundararajan, Priya Mudgil, Samonee Ralmilay, Niharika Dutta, Manphool Singhal, Arun Kumar, Saugata Sen, Anurima Patra, Priya Ghosh, Abanti Das, Amit Gupta, Ashish Verma, Dipin Sudhakaran, Ekta Dhamija, Himangi Unde, Ishan Kumar

PDF

Open Access 1 Datasets

TL;DR

This study introduces RXL-RADSet, a synthetic radiology report dataset, and benchmarks 41 language models, showing large models can nearly match proprietary performance in RADS classification with guided prompts.

Contribution

The paper creates a validated synthetic multi-RADS dataset and systematically compares open-weight language models with a proprietary model for radiology report classification.

Findings

01

GPT-5.2 achieved 99.8% validity and 81.1% accuracy.

02

Large open-weight models (20-32B) approach proprietary model performance.

03

Model performance improves with size and guided prompting enhances accuracy.

Abstract

Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RadioX-Labs/RADSet
dataset· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiology practices and education · Artificial Intelligence in Healthcare and Education · Topic Modeling