# Developing a Quality Evaluation Index System for Health Conversational Artificial Intelligence: Mixed Methods Study

**Authors:** Weizhen Liao, Meng Li, Chengyu Ma, Youli Han, Dan Wang, Haopeng Liu, Yi Wang, Zijie Feng, Huichao Wang, Yiru Guan

PMC · DOI: 10.2196/83188 · Journal of Medical Internet Research · 2026-01-19

## TL;DR

This study creates a quality evaluation system for health AI chatbots to ensure they are safe, reliable, and effective for healthcare use.

## Contribution

A novel, user-centered index system for evaluating health conversational AI quality, validated through expert consultation and statistical analysis.

## Key findings

- The final system includes 3 primary indicators, 7 secondary, and 28 tertiary indicators.
- Ethics and compliance had the highest weight in the evaluation framework.
- The system is scientifically valid and practically relevant for assessing and improving HCAI.

## Abstract

Effective communication is fundamental to health care; however, demographic transitions and a widening global health workforce gap are intensifying the imbalance between service demand and resource supply. Health conversational artificial intelligence (HCAI) based on large language models offers a potential pathway to improve the accessibility and personalization of care. Nevertheless, the lack of a rigorous, user-centered evaluation framework limits the systematic assessment of HCAI quality, raising concerns regarding safety, reliability, and clinical applicability.

This study aims to establish a scientific and systematic quality evaluation index system for HCAI, providing both a theoretical foundation and a practical tool for the assessment and optimization of HCAI.

Based on a literature review, industry standards, and expert group discussions, a preliminary framework for the index system was established. Two rounds of Delphi expert consultations were then conducted to collect expert opinions. The analytic hierarchy process (AHP) was applied to assign weights to indicators at each level, and the final content and structure of the index system were determined.

Both rounds of expert consultation achieved a 100% response rate. The authority coefficient of the experts was 0.84 in both rounds. Kendall W coefficient ranged from 0.14 to 0.20 in the first round and from 0.13 to 0.17 in the second round, with all values showing statistical significance (round one: importance P＜.001, feasibility P＜.001, sensitivity P＜.001; round two: importance P=.001, feasibility P＜.001, sensitivity P=.001). The final HCAI quality evaluation index system consisted of 3 primary indicators, 7 secondary indicators, and 28 tertiary indicators. According to AHP weight calculations, the primary indicators were ranked in descending order as follows: ethics and compliance (0.4781), health consultation capability (0.4112), and user experience (0.1107).

The evaluation index system constructed in this study demonstrates scientific validity and practical relevance. It provides a valuable reference for the quality assessment, model optimization, and regulatory oversight of HCAI systems.

## Full-text entities

- **Diseases:** CRAFT-MD (MESH:D013736), AI (MESH:C538142), hallucination (MESH:D006212), health (OMIM:603663), CPMI (MESH:C562377), LLMs (MESH:D007806), AHP (MESH:D010335), disease (MESH:D004194), HCAI (MESH:D003291)
- **Chemicals:** Cr (MESH:D002857), Ca (MESH:D002118), Cs (MESH:D002586)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12865354/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12865354/full.md

## References

71 references — full list in the complete paper: https://tomesphere.com/paper/PMC12865354/full.md

---
Source: https://tomesphere.com/paper/PMC12865354