# Evaluation of the accuracy and repeatability of Deepseek V3, Doubao, and Kimi1.5 in answering knowledge-related queries about chronic non-bacterial osteitis

**Authors:** Zhenxing Zhu, Jun Xie, Longxin Zhou, Chaoran Yang, Feng Li

PMC · DOI: 10.3389/frai.2025.1629149 · Frontiers in Artificial Intelligence · 2025-09-29

## TL;DR

This study compares how accurately and consistently three Chinese AI models answer questions about chronic non-bacterial osteitis, a bone condition.

## Contribution

The study evaluates the performance of three Chinese AI models in answering medical questions about chronic non-bacterial osteitis using expert assessments.

## Key findings

- Doubao had the shortest response time and longest answers but received some incorrect ratings in one round.
- Kimi1.5 scored highest in most rounds according to expert evaluations.
- All three models showed good accuracy and reproducibility with no significant differences overall.

## Abstract

There are significant differences in the diagnosis and treatment of chronic non-bacterial osteitis (CNO), and there is an urgent need for health education efforts to enhance awareness of this condition. Deepseek V3, Doubao, and Kimi1.5 are highly popular language models in China that can provide knowledge related to diseases. This article aims to investigate the accuracy and reproducibility of the responses provided by these three artificial intelligence (AI) language models in answering questions about CNO.

According to the latest expert consensus, 16 questions related to CNO were collected. The three AI language models were separately asked these questions at three different times. The answers were independently evaluated by two orthopedic experts.

Among the responses of the three AI models to 16 CNO-related questions across three rounds of testing, only Doubao received “Completely incorrect” ratings (accounting for 6.25%) in the third round of scoring by Reviewer 2. During the answering process, Doubao had the shortest response time and provided the most words in its answers. In the first and third rounds of scoring by the first expert, Kimi scored the highest (3.938 ± 0.342, 3.875 ± 0.873), while in the second round, Doubao scored the highest (3.875 ± 0.5). In the second round of scoring by the second expert, Doubao received the highest score (3.812 ± 0.403). In the first and third rounds, Kimi1.5 received the highest score (3.812 ± 0.602, 3.812 ± 0.704).

Deepseek V3, Doubao, and Kimi1.5 are capable of answering most questions related to CNO with good accuracy and reproducibility, showing no significant differences.

## Full-text entities

- **Diseases:** CNO (MESH:D010000)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12515971/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12515971/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/PMC12515971/full.md

---
Source: https://tomesphere.com/paper/PMC12515971