# Decoding AI Competence: Benchmarking Large Language Models (LLMs) in Ovarian Cancer Diagnosis and Treatment—A Systematic Evaluation of Generative AI Accuracy and Completeness

**Authors:** Haojie Cai, Chao Wang, Yue Zhang, Hui Ding, Wei Hong, Yaqian Zhao, Shanshan Cheng, Yu Wang

PMC · DOI: 10.3390/diagnostics16040616 · 2026-02-20

## TL;DR

This study compares two AI models for their ability to help with ovarian cancer diagnosis and treatment, finding one performs significantly better than the other.

## Contribution

The study introduces a systematic evaluation framework for assessing large language models' performance in ovarian cancer management.

## Key findings

- DeepSeek-R1 received 98 'Excellent' ratings compared to 41 for Doubao-1.5-pro.
- DeepSeek-R1 had all 20 questions rated as 'Excellent' on average, while Doubao-1.5-pro had only 9.
- DeepSeek-R1 showed less variability and outperformed Doubao-1.5-pro in all categories.

## Abstract

Objective: To evaluate the practical value of DeepSeek-R1 and Doubao-1.5-pro in the context of ovarian cancer management by examining their diagnostic and treatment-related competencies. Methods: 20 key ovarian cancer diagnosis and treatment issues were identified, divided into 4 domains with 5 questions each. Two large language models answered these questions, and 5 gynecologic oncology chief physicians evaluated the answers on a 1–10 scale for completeness and accuracy. For each score and the mean score for each question, if it surpassed 7, it is evaluated as “Excellent.” The Kruskal–Wallis test compared scores within each LLM across 4 categories, and the Mann–Whitney-Wilcoxon test compared scores between the two LLMs in each category. Results: 200 scores were collected (100 per model). DeepSeek-R1 got 98 “Excellent” ratings, while Doubao-1.5-pro got 41. All 20 DeepSeek-R1 responses had “Excellent” average scores, compared to 9 for Doubao-1.5-pro. DeepSeek-R1 had less variability. Tests revealed significant differences between the models and showed DeepSeek-R1 outperformed Doubao-1.5-pro, and charts showed Doubao-1.5-pro scored lower in all aspects, especially “Medical”. Conclusions: DeepSeek-R1 shows potential in ovarian cancer diagnosis and treatment but has limitations like inaccuracies and overly technical responses due to outdated data and lack of humanistic elements. LLMs like DeepSeek-R1 are useful for medical education and assistive diagnosis, but they require ongoing updates and refinement for broader clinical use. Selecting the appropriate LLM for medical tasks and improving their clarity and accuracy is crucial for their future effectiveness.

## Linked entities

- **Diseases:** ovarian cancer (MONDO:0005140)

## Full-text entities

- **Genes:** BRCA1 (BRCA1 DNA repair associated) [NCBI Gene 672] {aka BRCAI, BRCC1, BROVCA1, FANCS, IRIS, PNCA4}, TBCE (tubulin folding cofactor E) [NCBI Gene 6905] {aka HRD, KCS, KCS1, PEAMO, pac2}, MUC16 (mucin 16, cell surface associated) [NCBI Gene 94025] {aka CA125}
- **Diseases:** bloating (MESH:C535647), neuropathy (MESH:D009422), negative (MESH:D064726), depression (MESH:D003866), breast cancer (MESH:D001943), Ovarian Cancer (MESH:D010051), Li-Fraumeni syndrome (MESH:D016864), vaginal dryness (MESH:D014627), cardiovascular disease (MESH:D002318), gynecologic malignancies (MESH:D005833), dyspareunia (MESH:D004414), male breast cancer (MESH:D018567), weight loss (MESH:D015431), Med-PaLM 2 (MESH:C535620), osteoporosis (MESH:D010024), ascites (MESH:D001201), hallucinations (MESH:D006212), death (MESH:D003643), III (MESH:C537189), hip osteoarthritis (MESH:D015207), triple (MESH:C536008), germ cell tumors (MESH:D009373), LBP (MESH:D017116), pelvic mass (MESH:C536030), Lynch syndrome (MESH:D003123), pleural effusion (MESH:D010996), LLMs (MESH:D007806), obesity (MESH:D009765), fatigue (MESH:D005221), dyspnea (MESH:D004417), Tumor (MESH:D009369), anxiety (MESH:D001007), bowel obstruction (MESH:D012778), injury to (MESH:D014947), hereditary cancer syndromes (MESH:D009386), pain (MESH:D010146)
- **Chemicals:** alcohol (MESH:D000438), Bisphosphonates (MESH:D004164), glucose (MESH:D005947), calcium (MESH:D002118), lipids (MESH:D008055), DeepSeek (-), denosumab (MESH:D000069448), zoledronate (MESH:D000077211), platinum (MESH:D010984), vitamin D (MESH:D014807), pro (MESH:D011392)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12939671/full.md

---
Source: https://tomesphere.com/paper/PMC12939671