Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness
Haojie Cai, Chao Wang, Yue Zhang, Hui Ding, Wei Hong, Yaqian Zhao, Shanshan Cheng, Yu Wang

TL;DR
This study compares two AI models for their ability to help with ovarian cancer diagnosis and treatment, finding one performs significantly better than the other.
Contribution
The study introduces a systematic evaluation framework for assessing large language models' performance in ovarian cancer management.
Findings
DeepSeek-R1 received 98 'Excellent' ratings compared to 41 for Doubao-1.5-pro.
DeepSeek-R1 had all 20 questions rated as 'Excellent' on average, while Doubao-1.5-pro had only 9.
DeepSeek-R1 showed less variability and outperformed Doubao-1.5-pro in all categories.
Abstract
Objective: To evaluate the practical value of DeepSeek-R1 and Doubao-1.5-pro in the context of ovarian cancer management by examining their diagnostic and treatment-related competencies. Methods: 20 key ovarian cancer diagnosis and treatment issues were identified, divided into 4 domains with 5 questions each. Two large language models answered these questions, and 5 gynecologic oncology chief physicians evaluated the answers on a 1–10 scale for completeness and accuracy. For each score and the mean score for each question, if it surpassed 7, it is evaluated as “Excellent.” The Kruskal–Wallis test compared scores within each LLM across 4 categories, and the Mann–Whitney-Wilcoxon test compared scores between the two LLMs in each category. Results: 200 scores were collected (100 per model). DeepSeek-R1 got 98 “Excellent” ratings, while Doubao-1.5-pro got 41. All 20 DeepSeek-R1 responses…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · AI in cancer detection
