Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness

Haojie Cai; Chao Wang; Yue Zhang; Hui Ding; Wei Hong; Yaqian Zhao; Shanshan Cheng; Yu Wang

PMC · DOI:10.3390/diagnostics16040616·February 20, 2026

Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness

Haojie Cai, Chao Wang, Yue Zhang, Hui Ding, Wei Hong, Yaqian Zhao, Shanshan Cheng, Yu Wang

PDF

Open Access

TL;DR

This study compares two AI models for their ability to help with ovarian cancer diagnosis and treatment, finding one performs significantly better than the other.

Contribution

The study introduces a systematic evaluation framework for assessing large language models' performance in ovarian cancer management.

Findings

01

DeepSeek-R1 received 98 'Excellent' ratings compared to 41 for Doubao-1.5-pro.

02

DeepSeek-R1 had all 20 questions rated as 'Excellent' on average, while Doubao-1.5-pro had only 9.

03

DeepSeek-R1 showed less variability and outperformed Doubao-1.5-pro in all categories.

Abstract

Objective: To evaluate the practical value of DeepSeek-R1 and Doubao-1.5-pro in the context of ovarian cancer management by examining their diagnostic and treatment-related competencies. Methods: 20 key ovarian cancer diagnosis and treatment issues were identified, divided into 4 domains with 5 questions each. Two large language models answered these questions, and 5 gynecologic oncology chief physicians evaluated the answers on a 1–10 scale for completeness and accuracy. For each score and the mean score for each question, if it surpassed 7, it is evaluated as “Excellent.” The Kruskal–Wallis test compared scores within each LLM across 4 categories, and the Mann–Whitney-Wilcoxon test compared scores between the two LLMs in each category. Results: 200 scores were collected (100 per model). DeepSeek-R1 got 98 “Excellent” ratings, while Doubao-1.5-pro got 41. All 20 DeepSeek-R1 responses…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · AI in cancer detection