# Agreement testing of AMSTAR-PF, a tool for quality appraisal of systematic reviews of prognostic factor studies

**Authors:** Michael L Henry, Neil E O’Connell, Richard D Riley, Karel G M Moons, Beverley J Shea, Lotty Hooft, Sarah B Wallwork, Johanna A A G Damen, Nicole Skoetz, Ruth P Appiah, Carolyn Berryman, Sophie M Crouch, Grace A Ferencz, Ashley R Grant, Katherine M Henry, Aleksandra M Herman, Emma L Karran, Indika Koralegedera, Hayley B Leake, Erin MacIntyre, Brendan Mouatt, Karma Phuentsho, Daniel A Van Der Laan, Ellana Welsby, Louise K Wiles, Erica M Wilkinson, Marelle K Wilson, Monique V Wilson, G Lorimer Moseley

PMC · DOI: 10.1136/bmjopen-2025-109388 · 2026-01-27

## TL;DR

This study tested a new tool called AMSTAR-PF for evaluating the quality of systematic reviews on prognostic factors and found it to be useful despite some variability in ratings.

## Contribution

The study introduces and evaluates the usability of AMSTAR-PF, a novel quality appraisal tool for systematic reviews of prognostic factor studies.

## Key findings

- Interrater agreement averaged 0.59, indicating moderate agreement across domains.
- Intrapair agreement was higher at 0.75, with 94.6% of ratings being identical or one category apart.
- Appraisal time improved with use, averaging 34 minutes after the first two appraisals.

## Abstract

To test the agreement and usability of a novel quality appraisal tool: A MeaSurement Tool to Assess systematic Reviews of Prognostic Factor studies (AMSTAR-PF).

Observational study.

14 appraisers of varied experience levels and backgrounds, including undergraduate, master’s and PhD students, postgraduate researchers, research fellows and clinicians.

Eight systematic reviews were rated by all reviewers using AMSTAR-PF.

Planned measures included intrapair and inter-pair agreement using Cohen’s and Fleiss’ kappa, time of use and time to reach consensus. Interrater agreement was an added measure, and Gwet’s agreement coefficient was calculated and presented due to its greater stability across agreement levels. The percentage of intrapair agreements identical or one category apart was also presented.

Interrater agreement averaged 0.59 (range 0.21–0.90), inter-pair agreement 0.61 (range 0.24–0.91) and intrapair agreement 0.75 (range 0.45–0.95) across the domains, with agreement for the overall rating 0.46 (95% CI 0.30 to 0.62) for interrater agreement, 0.46 (95% CI 0.17 to 0.74) for inter-pair agreement and 0.68 (range of averages 0.22–1.00) for intrapair agreement. The majority (60.7%) of intrapair ratings were identical, with 94.6% of final ratings either identical or only one category different for the overall appraisal. The time taken to appraise a study with AMSTAR-PF improved with use and averaged around 34 min after the first two appraisals.

Despite some variance in agreement for different domains and between different appraisers, the testing results suggest that AMSTAR-PF has clear utility for appraising the quality of systematic reviews of prognostic factor studies.

## Full-text entities

- **Diseases:** brain injury (MESH:D001930), pain (MESH:D010146), Back pain (MESH:D001416), COVID-19 (MESH:D000086382), Cancer (MESH:D009369), PN (MESH:C536741), low back pain (MESH:D017116), concussion (MESH:D001924)
- **Chemicals:** OSF (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], PF [taxon 1985359]

## Figures

16 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12853518/full.md

---
Source: https://tomesphere.com/paper/PMC12853518