# Evaluating the external validity of an artificial intelligence-based mobile support app for caregiving relatives by an online expert survey

**Authors:** Dominik Wolff, Michael Marschollek

PMC · DOI: 10.1186/s12911-026-03407-2 · BMC Medical Informatics and Decision Making · 2026-02-27

## TL;DR

This study evaluates how well an AI-based app for caregivers aligns with expert opinions using an online survey, finding high validity and precision.

## Contribution

The paper introduces a method for evaluating AI-based expert systems using external expert surveys and highlights its challenges and benefits.

## Key findings

- Experts rated the app's topic recommendations highly, with an average score of 4.4 and median of 5.
- The system showed high precision (0.965) and recall (0.986) in personalization.
- Experts were mostly unanimous, with rare disagreements in topic ordering.

## Abstract

Mobile Care Backup is a support app for family caregivers that provides textual information on topics personalized to their specific care situation. Personalization is performed by an artificial intelligence-based expert system. Here, we present the evaluation of the expert system’s validity with project external nursing experts. Furthermore, we discuss the general limitations of an online survey as an evaluation methodology for expert systems.

This study was conducted as an online survey in German and English. A total of nine experts, all of whom were female and had extensive (outpatient) care experience, were included. The participants were presented with descriptions of multiple fictitious family caregivers and the system’s personalized list of topics. They were then asked to rate the appropriateness on a five-item Likert scale and suggest additional topics. The collected data was analyzed descriptively to investigate whether MoCaB‘s topic recommendation strategy aligns with project external experts. For deviating topic sequences, the consensus of the experts was verified by pairwise rank correlation using Spearman’s Rho. Additional suggested topics were checked to see if they were part of the system but not provided (false negatives).

In the 495 submitted ratings, participants rated the suggested topics‘ appropriateness relatively high, with an average rating of 4.4 and a median of 5. This indicates that participants consider most of the recommended topics important for the fictitious family caregiver. The system‘s personalization performance was high (precision of 0.965 and recall of 0.986). Overall, the experts are unanimous. There is no unique alternative sequence regarding the rare cases of disagreement with the system in the ordering of topics.

The MoCaB system’s external validity is high, and isolated inconsistencies will be resolved in the project group. Using an online survey to evaluate the system’s validity with external experts is complex and time-consuming. Participants need a very high degree of competence, as they must infer from the title to the content. Nevertheless, it is an essential step in the evaluation process of expert systems and, if carried out correctly, can identify weak spots and further improve the expert system.

The online version contains supplementary material available at 10.1186/s12911-026-03407-2.

## Full-text entities

- **Diseases:** IDs 3 and 9 (MESH:C535742), ID (MESH:C537985), aggressions (MESH:D010554), dementia (MESH:D003704), stroke (MESH:D020521)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12952002/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12952002/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12952002/full.md

---
Source: https://tomesphere.com/paper/PMC12952002