Assessing the quality and integrating the evidence and strength of recommendations in the guidelines for gastric precancerous lesions

Jia-yin Ou; Yang Liu; Lang Zhang; Tian-qi Luo; Jia-yu Li; Li-ming Lu; Lin Wang; Qiu-rong He; Xin Liu; Hua-feng Pan

PMC · DOI:10.1186/s12885-025-13687-y·February 15, 2025

Assessing the quality and integrating the evidence and strength of recommendations in the guidelines for gastric precancerous lesions

Jia-yin Ou, Yang Liu, Lang Zhang, Tian-qi Luo, Jia-yu Li, Li-ming Lu, Lin Wang, Qiu-rong He, Xin Liu, Hua-feng Pan

PDF

Open Access

TL;DR

This study evaluates the quality of clinical guidelines for gastric precancerous lesions and finds them to be generally poor with inconsistent evidence support.

Contribution

The study systematically assesses and identifies gaps in the quality and consistency of guidelines for gastric precancerous lesions.

Findings

01

Only one out of nine guidelines was considered high quality, with an average methodological quality score of 46.22%.

02

64.4% of recommendations were strong, but only 17.5% were supported by high-quality evidence.

03

The lowest scoring domain was applicability, with a mean score of 24.56%.

Abstract

Clinical practice guidelines (CPGs) are intended to offer appropriate recommendations for clinical practice based on the available evidence while acknowledging existing gaps and uncertainties. The quality of CPGs for gastric precancerous lesions (GPL), distribution of evidence quality, and strength of recommendations are unknown. Systematically evaluate the quality of CPGs for GPL and identify areas for improvement in the development process. PubMed, Embase, Cochrane Library, Cumulative Index to Nursing and Allied Health Literature, and six online CPG repositories were systematically searched for CPGs related to GPL. Three researchers independently assessed the methodological quality of the included CPGs by using the AGREE II tool. The reporting and recommendation quality of the CPGs were evaluated using the RIGHT and AGREE-REX tools through consensus. Evidence-based CPGs were…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

GPL

Figures2

Click any figure to enlarge with its caption.

Flow diagram for search and selection of CPGs. CPGs, clinical practice guidelines

Correlations among the scores of AGREE II, RIGHT, and AGREE-REX domains. AGREE, Appraisal of Guidelines for Research and Evaluation; RIGHT, Reporting Items for Practice Guidelines in Healthcare; AGREE-REX, Appraisal of Guidelines for Research and Evaluation-Recommendation Excellence; p* < 0.05, *p* < 0.001, p* < 0.001; The depth of the color of the squares represents the magnitude of the Spearman correlation coefficient

Funding8

—http://dx.doi.org/10.13039/501100001809National Natural Science Foundation of China
—Innovation Team and Talents Cultivation Program of National Administration of Traditional Chinese Medicine
—Key Research and Development Program of Guangdong Province
—Key Laboratory (Pi Wei diseases and Pi-deficiency syndrome) of State Administration of Traditional Chinese Medicine
—Guangdong provincial key laboratory of syndrome and formula
—Special Funds for the Cultivation of Guangdong College Students’ Scientific and Technological Innovation (“Climbing Program” Special Funds)
—Guangzhou University of Chinese Medicine discipline quality improvement project
—College Students’ Innovative Entrepreneurial Training Plan Program

Keywords

Clinical practice guidelinesAGREE IIRIGHTAGREE-REXGastric precancerous lesionsQuality appraisal

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsClinical practice guidelines implementation · Gastric Cancer Management and Outcomes · Colorectal Cancer Screening and Detection

Full text

Introduction

Gastric cancer remains a significant global issue and is the fifth most prevalent cancer and the third leading cause of cancer-related deaths, according to the latest published international cancer statistics [1]. Despite their potential for early recognition and treatment, most gastric cancer cases are diagnosed at advanced stages, resulting in high mortality among patients. However, in the early development of gastric cancer, there are often some reversible early precancerous lesions, including atrophic gastritis (AG), intestinal metaplasia (IM), and dysplasia, which can be used as markers for the early development of gastric cancer [2, 3]. The probabilities of atrophic gastritis, intestinal metaplasia, and moderate-to-mild dysplasia developing into gastric cancer within 5 years were 0.1%, 0.25%, and 0.6%, respectively, and that of severe dysplasia was as high as 6% [4].

Clinical practice guidelines (CPGs) for gastric precancerous lesions (GPL) can assist doctors in systematically assessing the risk and progression of GPL and in performing interventions based on risk stratification to reduce the occurrence of gastric cancer in the future [5]. Therefore, several countries and international organizations have constructed and revised CPGs for GPL to enhance its diagnostic and interventional efficacy. CPGs are established through systematic evaluations of evidence and provide recommendations that consider the advantages and disadvantages of various interventions, with the ultimate aim of providing patients with optimal healthcare and aiding medical practitioners in making well-informed clinical decisions [6]. The efficacy of CPGs is contingent on several pivotal factors, including their quality, rigor of the methodology employed, and transparency of the development process [7]. Notably, superior-quality CPGs correlate with more favorable treatment outcomes by by providing recommendations such as appropriate clinical treatment and management [8]. In addition, a clear definition of the level of evidence on which a CPG’s recommendations are based is also crucial [9, 10]. Previous studies have demonstrated considerable heterogeneity in CPG quality across various domains [11]. Common issues encountered in CPGs include an overwhelming volume of complex documentation, limited clarity regarding supporting evidence, disregard for important stakeholders, absence of editorial independence, and inadequate applicability [11–13]. Nevertheless, there remains a lack of quality assessments of existing CPGs for GPL, as well as a lack of comprehensive elucidation of the level of evidence underpinning these CPGs.

The Appraisal of Guidelines for Research and Evaluation II (AGREE II) is a valuable and dependable instrument for assessing the methodological quality of CPGs [14–16]. AGREEII can be used to evaluate CPGs for various diseases [14, 17–20]. However, a trustworthy CPG should meet rigorous methodological criteria and have transparent, clear, and complete reporting [21]. The Reporting Items for Practice Guidelines in Healthcare (RIGHT) has been widely implemented as a CPG reporting standard and is a useful tool for CPG makers in clinical medicine and public health and for journal editors, peer reviewers, and CPG users [22, 23]. To enhance the quality of CPG recommendations and guarantee their credibility, reliability, and implementation in clinical practice, an international research group devised a supplementary system called the Appraisal of Guidelines Research and Evaluation Recommendations Excellence (AGREE-REX). This system serves as an extension of the AGREE II [18, 24].

Therefore, this study aimed to use three evaluation tools (AGREE II, RIGHT, and AGREE-REX) to evaluate the methodology, reporting, and recommendation quality of CPGs for GPL and analyze the distribution of the quality of evidence and strength of recommendations in these CPGs. The purpose was also to identify potential factors that contribute to the substandard quality of CPGs and to emphasize potential avenues for improvement, thereby offering a reliable reference for the future development of CPGs in the field of GPL.

Methods

Eligibility criteria

CPGs were included if they: (1) focused on the diagnosis and management of GPL, which includes AG, IM, and dysplasia [4, 25]; (2) were written in English; and (3) were published from January 2011 to April 2023. Consistent with the methods used in previous studies [17, 20], both evidence- and consensus-based CPGs were included. For the updated CPG, only the most recent version was considered.

CPGs were excluded if: (1) full texts were not available; (2) they were reviews, letters, editorials, and correspondence studies; and (3) they were interpreted, translated, or adapted, as well as duplicate publications.

Search strategy

A researcher conducted a literature search, and four databases (PubMed, Embase, Cochrane Library, Cumulative Index to Nursing and Allied Health Literature) were systematically searched along with six CPG repositories: the National Institute for Health and Clinical Excellence, Scottish Intercollegiate Guidelines Network, Guidelines International Networks, Agency for Healthcare Research and Quality, National Health and Medical Research Council, and World Health Organization. These databases were searched by combining keywords and medical subject headings for CPGs and GPL; detailed search strategies are provided in Supplementary File 1. The search period was from January 1, 2011, to April 1, 2023.

Study selection

The retrieved records were exported to EndNote X7.7.1 (Thomson Reuters Corporation, CA, USA), and duplicates were eliminated using the software command “Find duplicates” and manual inspection. Two researchers screened the remaining records based on the titles and abstracts of the relevant articles. Subsequently, the researchers obtained the full texts of potentially relevant CPGs for further analysis, and the full texts of the included studies were assessed using the same inclusion and exclusion criteria. When a discrepancy arose regarding the assessment of CPGs by the two researchers, deliberations were held with the involvement of another researcher to achieve a consensus.

Data extraction

Two researchers performed data extraction and retrieved all supplementary materials related to the methodology for developing CPGs for a more comprehensive assessment of the included CPGs, and another researcher checked consistency. The characteristics of each CPG were extracted to understand their basic information for further subgroup analyses. Using the author and publication year as CPG IDs, we extracted the following data: country, type of development organization (medical association or society / expert panel), use of the CPG quality tool (yes, no, or not stated), version (updated or first time), development method (evidence-based (EB) or consensus-based (CB)), and source of funding (yes, no, or not stated). When a CPG was identified as not being developed by a specific association, it was classified by an “society / expert panel.”

Quality evaluation

The quality of the included CPGs was systematically evaluated using AGREE II, RIGHT, and AGREE-REX tools. To prepare for the evaluation, each assessor underwent training on the recommended training modules on the AGREE and RIGHT websites and learned how to evaluate each item of the CPGs in the AGREE II, RIGHT, and AGREE-REX user manuals. Subsequently, three CPGs were chosen for pilot evaluation, and discussions were held to resolve any notable disparities in scores between assessors and to clarify any other aspects that could enhance assessors’ evaluative skills.

Evaluation of methodological quality

Three researchers independently evaluated the methodological quality of the CPGs using the AGREE II tool, which consists of 23 items grouped into six domains: scope and purpose, stakeholder involvement, rigor of development, clarity of presentation, applicability, and editorial independence [26]. The researchers assessed each AGREE II item of the CPG according to the AGREE II guidelines and the criteria suggested for each item and scored it using a 7-point Likert scale [14], which ranged from 1 (strongly disagree) to 7 (strongly agree), where scores between 2 and 6 indicated inadequate adherence to the AGREE II item criteria. Three researchers independently evaluated and recorded the scores, after which the AGREE II score of each researcher was collated and recorded in a Microsoft Excel spreadsheet by one researcher. When there was a difference of more than two points in the score of any item included in the CPG, it was reassessed or discussed in a meeting until the score difference was reduced or consensus was reached. The individual domain scores for each CPG were then compiled and calculated as a percentage of the maximum attainable score, employing the formula (score obtained—lowest possible score)/(highest score—lowest possible score) × 100% [27].

In the comprehensive assessment of CPG, the overall rating item was evaluated using a 7-point Likert scale and calculated as percentages, employing the same methodology utilized in prior studies [18, 28]. Regarding the second global evaluation item, a CPG was deemed to be of high quality, consistent with previous research [18, 29, 30], if all three domains in the AGREE II tool were considered the most important: stakeholder engagement (domain 2), rigor of development (domain 3), and editorial independence (domain 6), attaining a minimum of 50% of the maximum possible score.

Evaluation of report quality

The RIGHT statement assesses the reporting quality of CPGs. It includes seven domains: basic information, background, evidence, recommendations, review and quality assurance, funding, declaration of interests and management, and other information. The seven domains involved 22 items accompanied by elaboration documents [31]. Each item’s reporting quality was assessed using a three-level scale by three researchers through discussion, with ratings of “completely reported,” “partially reported,” and “not reported” corresponding to numerical values of 1, 0.5, and 0 [32, 33], respectively. The reporting rate for each domain in each CPG was then calculated based on (scores from items “reported” in each domain)/(the total number of items in each domain) × 100%, and the overall reporting rate of each CPG was calculated based on the score for the “reported” item for all domains/35 [27].

Evaluation of recommendation quality

The AGREE-REX tool was designed to provide guidance for reporting recommendations for CPGs [24]. It encompasses three essential quality domains comprising nine individual items: clinical applicability, values and preferences, and implementability. Three researchers evaluated the recommendation quality of the included CPGs using the AGREE-REX instrument through in-person discussions, assigning item scores ranging from 1 to 7. The domain scores were calculated following the same methodology described in AGREE II.

Quality of evidence and strength of recommendations

We first identified the grading system used for each evidence-based CPG and the evidence of different qualities and recommendations for different strengths. The Grading of Recommendation Assessment, Development, and Evaluation (GRADE) system [34, 35], which is widely recognized by many professional societies, has been adopted as the most ideal and commonly used method for evidence grading and recommendation designation. To achieve a high degree of standardization of the statistical results, graded evidence and recommendations from the CPGs were incorporated into the GRADE system whenever possible.

During the evaluation process, recommendations lacking clarity regarding the quality of evidence and strength were excluded. In cases where multiple qualities of evidence were present to support a recommendation, the evidence with the highest quality was prioritized. By re-evaluating each CPG, we determined the distribution of evidence quality and recommendation strength in evidence-based CPGs.

Statistical analysis

The AGREE II, RIGHT, and AGREE-REX scores of the researchers were entered into Microsoft Excel (Microsoft, WA, United States) to compute standardized scores for each domain and the overall scores for each CPG. Continuous variables were presented as means ± standard deviation (SD), while categorical variables were presented as frequencies and percentages. Spearman’s correlation was used to investigate the correlations between the scores of the AGREE II, RIGHT, and AGREE-REX domains. An independent sample t-test was employed to assess the disparities between the two groups, while a one-way analysis of variance was used to compare the differences among multiple groups. We conducted subgroup analyses on the AGREE II, RIGHT, and AGREE-REX domains as well as the overall score, stratified by version (first edition or update), whether funding source is declared, whether conflict of interest is declared, publication country or region, publication time, development method, whether CPG quality assessment tools are used, journal, whether scoring system is used, scope, and disease. The inter-rater reliability of the AGREE II tool was tested using a two-way random-effects model to calculate intra-class correlation coefficients (ICC) and 95% confidence intervals (CI). Consistent with previous studies [17, 20] and according to Landis and Koch [36], the level of inter-rater reliability was categorized as follows: a value between 0.01 and 0.20 indicated a minor degree of agreement, a value between 0.21 and 0.40 denoted a fair degree, a value between 0.41 and 0.60 represented a moderate degree, a value between 0.61 and 0.80 indicated a substantial degree and a value between 0.81 and 1.00 indicated a very good level of agreement. A significance level of p < 0.05 was deemed statistically significant. Statistical analyses were performed using SPSS version 27 (IBM SPSS Statistics, IBM Corporation), and the bioinformatics online service platform (http://www.bioinformatics.com.cn/) was used to generate Spearman’s correlation coefficient maps for the scores of each CPG evaluation tool.

Results

Literature screening

A comprehensive search of literature websites yielded 2,765 records supplemented by three additional records from 2 CPG databases, NICE and AHRQ. After the elimination of duplicate records and initial screening, 101 results were obtained. These results were thoroughly analyzed through a full-text literature review and meticulous screening based on the eligibility criteria. Ultimately, nine CPGs were identified (Fig. 1).Fig. 1. Flow diagram for search and selection of CPGs. CPGs, clinical practice guidelines

Characteristics of CPGs

Among the nine CPGs identified, the majority originated from the United States (four CPGs), followed by China (two CPGs), Italy, the United Kingdom, and Europe each had one CPG. Notably, one of the CPGs from China was developed by an expert panel, whereas the remaining eight CPGs were developed by medical societies. Of the CPGs, eight were applicable for diagnosis or treatment purposes, two of which were suitable for prevention. Additionally, six CPGs disclosed their funding sources, five declared conflicts of interest, one included methodologists, and one CPG utilized a CPG-quality tool (Table 1).Table 1. Characteristics of included CPGsCPG IDDevelopment organizationType of development organizationCountry/region of originDevelopment methodUsed CPG quality toolIncluded CPG methodologistJournalFunding sourcesStated conflict of interestTang et al., 2012 [37]CGGDGsociety / expert panelChinaEBNo statedNo statedChin J Integr MedYesNoSharaf et al., 2013 [38]SPCASGEMedical AssociationUnited StatesEBNo statedNo statedGastrointest EndoscYesNoEvans et al., 2015 [39]SPCASGEMedical AssociationUnited StatesEBNo statedNo statedGastrointest EndoscYesNoBanks et al., 2019 [40]BSGMedical AssociationUKEBYesNo statedGutNo statedYesLahner et al., 2019 [41]SIED, SIGE, SIMIMedical AssociationItalianEBNo statedNo statedDig Liver DisNo statedNoPimentel-Nunes et al., 2019 [4]ESGE, EHMSG, ESP, SPEDMedical AssociationEuropeEBNo statedNo statedEndoscopyNo statedYesGupta et al., 2020 [42]AGAMedical AssociationUnited StatesEBNo statedYesGastroenterologyYesYesShah et al., 2021 [43]AGAMedical AssociationUnited StatesCBNo statedNo statedGastroenterologyYesYesWang et al., 2022 [44]SSDBCATCM, GBCMAMedical AssociationChinaEBNo statedNo statedChinese MedicineYesYesCPG ID clinical practice guideline identifier, EB evidence-based, CB consensus-based, CGGDG the Chronic Gastritis Guideline Development Group, SPCASGE The Standards of Practice Committee of the American Society for Gastrointestinal Endoscopy (ASGE), BSG The British Society of Gastroenterology, SIED the Italian Society of Digestive Endoscopy, SIGE the Italian Society of Gastroenterology, SIMI and the Italian Society of Internal Medicine, ESGE European Society of Gastrointestinal Endoscopy, EHMSG European Helicobacter, and Microbiota Study Group, ESP European Society of Pathology, SPED Sociedade Portuguesa de Endoscopia Digestiva, AGA American Gastroenterological Association, SSDBCATCM The Spleen and Stomach Diseases Branch of the China Association of Traditional Chinese Medicine, GBCMA The Gastroenterology Branch of Chinese Medical Association, UK United Kingdom of Great Britain and Northern Ireland

Evaluation of methodological quality

The mean scores of six domains of the AGREE II scale were as follows: domain 1 (scope and purpose) was 64.22% (SD = 16.09%); domain 2 (stakeholder involvement) was 31.78% (SD = 13.77%); domain 3 (rigour of development) was 49.56% (SD = 21.18%); domain 4 (clarity of presentation) was 71.67% (SD = 12.39%), which was the highest score among the six domains; domain 5 (applicability) was 24.56% (SD = 12.10%), which was the lowest among the six domains; domain 6 (editorial independence) was 36.33% (SD = 24.78%) (Table 2). The mean overall score of all CPGs was 46.22% (SD = 9.24%), and only one CPG [40] was considered of high quality. Items 5 “The views and preferences of the target population (patients, public, etc.) have been sought” and item 20 “The potential resource implications of applying the recommendations have been considered” had the lowest scores, both with a mean score below 2, distinguishing them from the other items. Items 12 “There is an explicit link between the recommendations and the supporting evidence” and item 16 “The different options for management of the condition or health issue are clearly presented” had the highest mean scores (Supplementary Table 1).Table 2. Domain and overall scores of CPGs according to AGREE IICPG IDScope and purpose (%)Stakeholder involvement (%)Rigour of development (%)Clarity of presentation (%)Applicability (%)Editorial independence (%)Overall score (%)Tang et al., 2012 [37]521188768039Sharaf et al., 2013 [38]52373669212540Evans et al., 2015 [39]70335578252547Banks et al., 2019 [40]78596074475362Lahner et al., 2019 [41]50245169262240Pimentel-Nunes et al., 2019 [4]87266489215056Gupta et al., 2020 [42]81434448118352Shah et al., 2021 [43]67311383385047Wang et al., 2022 [44]41223559241933Mean ± SD64 ± 1632 ± 1450 ± 2172 ± 1225 ± 1236 ± 2546 ± 9CPG ID clinical practice guideline identifier, AGREE Appraisal of Guidelines for Research and Evaluation

The ICC values for each domain were all > 0.80, and the interrater reliabilities of the three researchers were good (Supplementary Table 2). There was a high level of consistency in evaluating the CPGs among the three researchers.

Evaluation of report quality

The mean overall reporting quality score of CPGs was 54.89% (SD = 14.11%). Among the seven domains of the RIGHT scale, domain 1 (basic information) had the highest mean reporting rate at 70.00% (SD = 21.07%). Conversely, domain 6 (funding and declaration and management of interests) had the lowest mean reporting rate, with a rate of 37.67% (SD = 28.70%). The remaining five domains ranked in descending order of mean scores as follows: domain 3 (evidence) was 67.78% (SD = 16.42%), domain 4 (recommendations) was 60.33% (SD = 12.55%), domain 7 (other information) was 57.44% (SD = 22.15%), domain 5 (review and quality assurance) was 55.56% (SD = 27.32%), and domain 2 (background) was 52.22% (SD = 17.04%) (Table 3). Among all the items, item 1b “Describe the year of publication of the guideline” achieved the highest mean reporting results, with a rate of 100% (SD = 0%). Conversely, item 8b “Describe the setting(s) for which the guideline is intended, such as primary care,low- and middle-income countries, or inpatient facilities” achieved the poorest mean reporting rate of 0% (SD = 0%) (Supplementary Table 3).Table 3. Domain and overall scores of CPGs according to RIGHTCPG IDBasic information (%)Background (%)Evidence (%)Recommendations (%)Review and quality assurance (%)Funding and declaration and management of Interests (%)Other information (%)Overall score (%)Tang et al., 2012 [37]833860507501747Sharaf et al., 2013 [38]5044505725135044Evans et al., 2015 [39]3344705725255046Banks et al., 2019 [40]8381807950758379Lahner et al., 2019 [41]8325605025256749Pimentel-Nunes et al., 2019 [4]926910079100508380Gupta et al., 2020 [42]10063606450256764Shah et al., 2021 [43]6750504375386754Wang et al., 2022 [44]7556806475883367Mean ± SD74 ± 2152 ± 1768 ± 1660 ± 1356 ± 2738 ± 2957 ± 2259 ± 14CPG ID clinical practice guideline identifier, RIGHT Reporting Items for Practice Guidelines in Healthcare

Evaluation of recommendation quality

The mean overall score for the recommendation quality of the CPGs was 19.11% (SD = 7.74%), with the three domains ranked in descending order of mean scores: clinical applicability, 32.00% (SD = 13.54%); implementability, 24.00% (SD = 7.68%); and values and preferences, 7.00% (SD = 6.80%) (Table 4).Table 4. Domain and overall scores of CPGs according to AGREE-REXCPG IDClinical applicability (%)Values and differences (%)Implementability (%)Overall score (%)Tang et al., 2012 [37]61172533Sharaf et al., 2013 [38]110177Evans et al., 2015 [39]2802515Banks et al., 2019 [40]3342519Lahner et al., 2019 [41]3342519Pimentel-Nunes et al., 2019 [4]3943322Gupta et al., 2020 [42]33173326Shah et al., 2021 [43]224811Wang et al., 2022 [44]28132520Mean ± SD32 ± 147 ± 724 ± 819 ± 8CPG ID clinical practice guideline identifier, AGREE-REX Appraisal of Guidelines for Research and Evaluation-Recommendation Excellence

The three items that obtained the highest mean scores were item 1 “Evidence”, item 2 “Applicability to Target Users”, and item 8 “Purpose”; exclusively, these three items out of the total nine achieved a mean score exceeding three points. The three items with the lowest mean scores pertain to the “values and preferences” domain: item 4 “Values and Preferences of Target Users”, item 6 “Values and Preference of Policy/ Decision Makers”, and item 7 “Values and Preferences of Guideline Developers” (Supplementary Table 4). Remarkably, the mean scores of these three items were exceptionally low, nearly approaching the value of one.

Subgroup analysis of the scores of AGREE II, RIGHT, and AGREE-REX domains

In the analysis of subgroup scores for each domain and the overall score of the CPG assessment tool, only the domain “editorial independence” (p = 0.035) within the “stated conflict of interest” subgroup of AGREE II exhibited a statistically significant difference. There were statistical differences observed in the domain “background” (p = 0.009) and “funding and declaration and management of interests” (p = 0.027) within the “stated conflict of interest” subgroup of RIGHT. Additionally, statistical differences were found in the “overall score” (p = 0.035) within the “year” subgroup of RIGHT. Conversely, the AGREE-REX scores showed no statistical differences in any subgroup (Table 5).Table 5AGREE II, RIGHT, and AGREE-REX domain and overall scores for different subgroups of CPGs (Mean ± SD, %)AGREE IISubgroupsStatisticsScope and purposeStakeholder involvementRigour of developmentClarity of presentationApplicabilityEditorial independence****AGREE II overall scoreVersionP = 0.759P = 0.499P = 0.356P = 0.251P = 0.621P = 0.974P = 0.901 First5(55.56%)62.6 ± 15.4934.8 ± 18.3155.8 ± 20.0667.2 ± 11.1722.6 ± 15.4736.6 ± 32.0546.60 ± 10.14 Updated4(44.44%)66.25 ± 1928 ± 4.9741.75 ± 22.6877.25 ± 12.9727 ± 7.5336 ± 16.3545.75 ± 9.50Founding sourcesP = 0.36P = 0.52P = 0.416P = 0.366P = 0.26P = 0.678P = 0.149 Yes6(66.67%)60.5 ± 14.6829.5 ± 11.4245.17 ± 25.1268.83 ± 13.1721.17 ± 10.833.67 ± 28.9843.00 ± 6.70 No stated3(33.33%)71.67 ± 19.336.33 ± 19.6658.33 ± 6.6677.33 ± 10.4131.33 ± 13.841.67 ± 17.152.67 ± 11.37Stated conflicts of interestP = 0.186P = 0.312P = 0.347P = 0.794P = 0.345P = 0.035P = 0.186 Yes5(55.56%)70.8 ± 18.1736.2 ± 14.9943.2 ± 20.5870.6 ± 16.9528.2 ± 14.2751 ± 22.6650.00 ± 10.98 No4(44.44%)56 ± 9.3826.25 ± 11.5357.5 ± 21.9273 ± 4.6920 ± 8.2918 ± 12.0841.50 ± 3.70Country/region of originP = 0.618P = 0.448P = 0.115P = 0.670P = 0.872P = 0.341P = 0.942 America4(44.44%)67.50 ± 11.9636.00 ± 5.2937.00 ± 17.8069.50 ± 15.4623.75 ± 11.1845.75 ± 27.4946.00 ± 12.35 Other countries5(55.56%)61.60 ± 19.7828.40 ± 18.0659.60 ± 19.4073.40 ± 10.9225.20 ± 14.0628.80 ± 22.4046.50 ± 4.93YearP = 0.449P = 0.499P = 0.344P = 0.678P = 0.278P = 0.091P = 0.367 > 20176(66.67%)67.33 ± 18.3434.17 ± 14.3044.50 ± 18.6870.33 ± 15.1827.83 ± 12.8046.17 ± 23.4748.33 ± 10.63 < 20173(33.33%)58.00 ± 10.3927.00 ± 14.0059.67 ± 26.3174.33 ± 4.7318.00 ± 8.8916.67 ± 14.4342.00 ± 4.36RIGHTSubgroupsStatisticsBasic informationBackgroundEvidenceRecommendationsReview and quality assuranceFunding and declaration and management of InterestsOther informationRIGHT Overall scoreVersionP = 0.391P = 0.718P = 0.264P = 0.936P = 0.215P = 0.265P = 0.93P = 0.620 First5(55.56%)79.8 ± 18.2150.2 ± 21.9962 ± 10.9560 ± 12.145 ± 20.9227.6 ± 28.4456.8 ± 25.1256.60 ± 14.71 Updated4(44.44%)66.75 ± 24.854.75 ± 10.6975 ± 20.8260.75 ± 14.9868.75 ± 31.4650.25 ± 27.1658.25 ± 21.5661.75 ± 14.93Founding sourcesP = 0.251P = 0.484P = 0.118P = 0.135P = 0.845P = 0.398P = 0.042P = 0.121 Yes6(66.67%)68 ± 23.8749.17 ± 9.1361.67 ± 11.6955.83 ± 8.1854.17 ± 24.5831.5 ± 30.5147.33 ± 19.5653.67 ± 9.81 No stated3(33.33%)86 ± 5.258.33 ± 29.4880 ± 2069.33 ± 16.7458.33 ± 38.1950 ± 2577.67 ± 9.2459.33 ± 17.62Stated conflicts of interestP = 0.143P = 0.009P = 0.225P = 0.154P = 0.071P = 0.027P = 0.18P = 0.005 Yes5(55.56%)83.4 ± 13.1363.8 ± 11.9974 ± 19.4965.8 ± 14.7970 ± 20.9255.2 ± 25.9966.6 ± 20.4268.80 ± 10.89 No stated4(44.44%)62.25 ± 24.9537.75 ± 8.9660 ± 8.1753.5 ± 4.0437.5 ± 2515.75 ± 11.9346 ± 20.9346.50 ± 2.08Country/region of originP = 0.153P = 0.779P = 0.092P = 0.307P = 0.273P = 0.272P = 0.908P = 0.209 America4(44.44%)62.50 ± 28.6050.25 ± 8.9657.50 ± 9.5755.25 ± 8.8143.75 ± 23.9425.25 ± 10.2158.50 ± 9.8264.40 ± 15.84 Other countries5(55.56%)83.20 ± 6.0253.80 ± 22.6776.00 ± 16.7364.40 ± 14.5065.00 ± 28.5047.60 ± 35.9456.60 ± 30.1152.00 ± 9.09YearP = 0.051P = 0.224P = 0.348P = 0.373P = 0.311P = 0.056P = 0.072P = 0.035 > 20176(66.67%)83.33 ± 11.7457.33 ± 19.1371.67 ± 16.3563.17 ± 14.7262.50 ± 26.2250.17 ± 26.3266.67 ± 18.2665.50 ± 12.66 < 20173(33.33%)55.33 ± 25.4242.00 ± 3.4660.00 ± 10.0054.67 ± 4.0441.67 ± 28.8712.67 ± 12.5039.00 ± 19.0545.67 ± 1.53AGREE-REXSubgroupsStatisticsClinical ApplicabilityValues and PreferencesImplementabilityAGREE-REX overall scoreVersionP = 0.619P = 0.527P = 0.692P = 0.501 First5(55.56%)34.20 ± 17.758.40 ± 8.0225.00 ± 5.6620.80 ± 9.65 Updated4(44.44%)29.25 ± 7.095.25 ± 5.5022.75 ± 10.5317.00 ± 4.97Founding sourcesP = 0.669P = 0.384P = 0.334P = 0.826 Yes6(66.67%)30.50 ± 16.748.50 ± 8.1222.17 ± 8.5918.67 ± 9.69 No stated3(33.33%)35.00 ± 3.464.00 ± 0.0027.67 ± 4.6220.00 ± 1.73Stated conflicts of interestP = 0.823P = 0.527P = 0.752P = 0.848 Yes5(55.56%)31.00 ± 6.368.40 ± 6.1924.8 ± 10.2119.60 ± 5.51 No stated4(44.44%)33.25 ± 20.765.25 ± 8.0623.00 ± 4.0018.50 ± 10.88Country/region of originP = 0.091P = 0.527P = 0.284P = 0.138 America4(44.44%)23.50 ± 9.475.25 ± 8.0620.75 ± 10.7222.60 ± 5.94 Other countries5(55.56%)38.80 ± 13.018.40 ± 6.1926.60 ± 3.5814.75 ± 8.18YearP = 0.850P = 0.706P = 0.676P = 0.847 > 20176(66.67%)31.33 ± 5.757.67 ± 5.8224.83 ± 9.1319.50 ± 4.93 < 20173(33.33%)33.33 ± 25.425.67 ± 9.8222.33 ± 4.6218.33 ± 13.32AGREE Appraisal of Guidelines for Research and Evaluation, CPG ID clinical practice guideline identifier, AGREE Appraisal of Guidelines for Research and Evaluation, RIGHT Reporting Items for Practice Guidelines in Healthcare, AGREE-REX Appraisal of Guidelines for Research and Evaluation Recommendation Excellence

Correlations among the scores of AGREE II, AGREE-REX, and RIGHT domains

A partial positive correlation existed between the AGREE II, AGREE-REX, and RIGHT domain scores. Specifically, there was a high positive correlation between “scope and purpose,” “overall score,” and “editorial independence” in AGREE II (r > 0.80). Moreover, there is a strong positive correlation between “stakeholder involvement” and “editorial independence” (r > 0.80). Furthermore, there was a strong positive correlation between “editorial independence” and the “overall score” and the overall AGREE II score (r > 0.80). There was also a positive correlation between the “overall score” of AGREE II and the “other information” in RIGHT (r > 0.80). In addition, there were positive correlations (r > 0.80) between “background” and “recommendations” of the RIGHT instrument, as well as between “overall score,” “background,” and “funding, declaration, and management of benefits.” In AGREE-REX, positive correlations were found between “overall score,” “clinical applicability,” and “values and preferences” (r > 0.80). All correlations were statistically significant (p < 0.05) (Fig. 2).Fig. 2. Correlations among the scores of AGREE II, RIGHT, and AGREE-REX domains. AGREE, Appraisal of Guidelines for Research and Evaluation; RIGHT, Reporting Items for Practice Guidelines in Healthcare; AGREE-REX, Appraisal of Guidelines for Research and Evaluation-Recommendation Excellence; *p < 0.05, **p < 0.001, ***p < 0.001; The depth of the color of the squares represents the magnitude of the Spearman correlation coefficient

Quality of evidence and strength of recommendations

Of the eight evidence-based CPGs, seven used the GRADE system for grading evidence, and one used the Oxford Centre for Evidence-based Medicine Levels of Evidence system. In total, 235 recommendations were identified (Table 6).Table 6. The grading system used and the distribution of the quality of evidence and the strength of recommendations among evidence-based CPGs before and after the synthesis of the GRADE systemCPG IDName of grading systemQuality of evidence, No. (%)Strength of recommendation, No. (%)Tang et al., 2012 [37]OCEBM Levels of Evidence systemIa:2Ib:0IIa:26IIb:0IIIa:0IIIb:16IV:11V:1A: 4B: 42C: 12Sharaf et al., 2013 [38]GRADE systemHigh quality: 1Moderate quality: 3Low quality: 28Very low quality: 1Strong recommendation: 4Weak recommendation: 29Evans et al., 2015 [39]GRADE systemHigh quality: 4Moderate quality: 4Low quality: 8Very low quality: 0Strong recommendation: 8Weak recommendation: 8Banks et al., 2019 [40]GRADE systemHigh quality: 3Moderate quality: 9Low quality: 19Strong recommendation: 24High recommendation: 1Weak recommendation: 6Lahner et al., 2019 [41]GRADE systemA: 2B: 4C: 5D: 0Strong recommendation: 6Conditional recommendation: 5Pimentel-Nunes et al., 2019 [4]GRADE systemHigh: 6Moderate: 11Low: 18Very low quality: 0Strong recommendation: 12Weak recommendation: 8Gupta et al., 2020 [42]GRADE systemHigh quality: 0Moderate quality: 2Low quality: 3Very low quality: 0Strong recommendation:3Conditional recommendation: 2Wang et al., 2022 [44]GRADE systemHigh: 11Moderate: 15Low: 22Strong recommendation: 39Weak recommendation: 9CPG ID clinical practice guideline identifier

After reevaluation of the CPGs, a discrepancy was observed between the distribution of the strength of recommendations and the quality of evidence in the CPGs. Among the 235 recommendations and the corresponding evidence, 64.4% were classified as strong recommendations, whereas only 12.4% were deemed to be high-quality evidence. In addition, only 17.5% of the strong recommendations were supported by high-quality evidence. Additionally, 38.3% of the evidence was moderate quality, 48.9% was considered low quality, and 0.4% was categorized as very low quality. Notably, among all CPGs, the CPG of the Spleen and Stomach Diseases Branch of the China Association of Traditional Chinese Medicine (SSDBCATCM) has the largest proportion of strong recommendations and one of the largest proportions of high-level evidence in its recommendations and evidence [44]. Specifically, high-level evidence accounted for 22.9% of all its evidence, while strong recommendations accounted for 81.3% of all its recommendations. In contrast, the CPG of the Standards of Practice Committee of the American Society for Gastrointestinal Endoscopy (SPCASGE) [38] had only 3.0% high-quality evidence, and only 12.1% of its recommendations were classified as strong (Table 7).Table 7. Distribution of quality of evidence and strength of recommendations among evidence-based CPGs after synthesis with the GRADE systemCPG IDNumber of recommendationsLevel of Evidence, No. (%)Strength of Recommendation, No. (%)HighModerateLowVery lowStrongWeakTang et al., 2012 [37]562 (3.6)42 (75)12 (21.4)046 (79.3)12 (20.7)Sharaf et al., 2013 [38]331 (3.0)3 (9.1)28 (84.8)1 (3.1)4 (12.1)29 (87.9)Evans et al., 2015 [39]164 (25.0)4 (25.0)8 (50.0)08 (50.0)8 (50.0)Banks et al., 2019 [40]313 (9.7)9 (29.0)19 (61.3)025 (80.6)6 (19.4)Lahner et al., 2019 [41]112 (18.2)4 (36.4)5 (45.5)06 (54.5)5 (45.5)Pimentel-Nunes et al., 2019 [4]356 (17.1)11 (35.5)18 (51.4)012 (34.3)8 (22.9)Gupta et al., 2020 [42]502 (40.0)3 (60.0)03 (60.0)2 (40.0)Wang et al., 2022 [44]4811 (22.9)15 (31.3)22 (45.8)039 (81.3)9 (18.7)Total, No. (%)23529 (12.3)90 (38.3)115 (48.9)1 (0.0)143 (60.9)79 (33.6)CPG ID clinical practice guideline identifier, GRADE Grading of Recommendation Assessment, Development, and Evaluation

Discussion

The overall quality of the methodology, recommendations, and reporting of the CPGs for GPL was low. Furthermore, only one of the nine CPGs was considered high quality. According to the evaluation results, the quality of the CPGs was highly heterogeneous, and the same CPGs had different scores in different domains. There was a significant correlation between the scores on some AGREE II, RIGHT, and AGREE-REX domains. Although the eight included CPGs were deemed evidence-based, the strong recommendations put forth by these CPGs were generally not supported by high-quality evidence.

After applying the screening criteria to identify relevant CPGs, nine qualified CPGs were identified, with only the CPG of the British Society of Gastroenterology [40] deemed of high quality. CPGs aim to assist doctors and patients in making informed healthcare decisions in specific clinical scenarios, ultimately enhancing the quality of care provided to patients [45]; however, low-quality CPGs’ invalid care recommendations may be confused by users [18]. Furthermore, the nine identified CPGs originated from five different countries or regions; however, owing to the complexity and changing global healthcare landscape [46], and this limited source may not fully address the challenge. Broader sources of CPGs will allow more professional medical experts from different countries and regions will be able to participate and share their experiences and perspectives to better adapt to the different medical needs of different regions.

Nine CPGs demonstrated a clear overall purpose and scope with concise expressions. However, the quality of domain 3 (rigour of development) was not sufficiently high and there is room for improvement. CPGs must rely on reliable evidence, and recommendations must be adopted to be truly beneficial [28]. Moreover, there is a need to clearly communicate the methodology used to make recommendations so that health professionals can understand the evidence on which the clinical recommendations they are considering implementing are based [47]. Among the six domains, domain 2 (stakeholder involvement) performed poorly, with item 5 (the patients’ views and preferences have been sought) having the lowest mean score among all items. Addressing patient and stakeholder engagement is often challenging, potentially impeding optimal CPG development and affecting future patient engagement [48]. However, there is a growing recognition of the importance of incorporating the perspectives of patient groups to which CPGs are applicable, and it is crucial to address these perspectives and involve stakeholders in future recommendations [49, 50]. Domain 5 (applicability) had the lowest mean score of all domains. This is a crucial domain, as not all healthcare settings can fulfill the checklist and may need to adapt recommendations to deliver care [51]. Consequently, a lack of applicability may affect the actual implementation of CPGs, which in turn may negatively affect patient outcomes and experiences [52]. Given the diversity of clinical practice and wide use of CPGs, it is critical to analyze potential barriers to recommendations in CPGs [53].

Among the seven RIGHT domains, the basic information domain had the best mean reporting rate, which could effectively provide basic information to help target users better understand the CPGs. It is worth noting that certain CPGs may encounter potential interference from external interests, consequently jeopardizing their objectivity and reliability and diminishing their overall quality [54]. Consequently, the domains of funding, benefit statements, and management performed the worst and were unfavorable for the quality of the CPGs. Developers may need to scrutinize CPGs more closely for potential conflicts of interest to maintain objectivity in medical practice. The absence of reporting on Item 8b in all CPGs suggests a lack of specific recommendations or details regarding the execution and implementation of CPGs. This is detrimental to helping clinicians implement CPGs and may have a negative impact on medical practice. To ensure effective implementation and assessment of implementation outcomes, enhancing the reporting rate in this domain for CPGs that require enforcement and implementation is advisable.

Among the three domains of the AGREE-REX, domain values and preferences had the lowest mean scores, with the three items displaying exceptionally low scores falling under this domain. Regarding values and preferences, a systematic review evaluating the incorporation of patient perspectives in the development of CPGs found that, while most institutions advocate for the inclusion of patients and their perspectives [55], there is a shortage of specific information on the methods employed to achieve this, which is also present in the CPGs we included. This can lead to CPG recommendations with insufficient consideration of patient needs, compromising patient experience. Therefore, CPG authors need to pay more attention to values and preferences, how to effectively incorporate patients’ views, further strengthen details, and consider more diverse needs.

After considering the evaluation results, we found that the overall scores of AGREE II and RIGHT in these two CPGs [4, 40] were the highest, and the scores in various domains of each tool were relatively high, and the scores in the domain of applicability were also high. Therefore, we believe that these two CPGs [4, 40] are the most suitable for clinicians among the CPGs we evaluated.

In the subgroup analysis of the scores of AGREE II, RIGHT, and AGREE-REX domains, we discovered that the declaration of member conflicts of interest during CPG development affected the scores of both the AGREE II editorial independence domain and the RIGHT “Funding and declaration and management of interests” domain. To ensure that external influences do not diminish the quality of CPGs, we recommend disclosing conflicts of interest among members as widely as possible during future CPG development. Our study also revealed that the publication time affects the RIGHT evaluation score. Newer CPGs had higher scores than older CPGs, indicating that CPGs enhanced reporting practices over time. In addition, the correlation analysis conducted on the AGREE II, RIGHT, and AGREE-REX domains indicated associations between some aspects of methodology, recommendations, and reporting quality.

Several CPGs used the GRADE system to develop recommendations, which helped provide consistent evidence evaluation and recommendations. Nevertheless, disparities were present in the dispersion of evidence quality and recommendation strength among different CPGs. For instance, the SSDBCATCM CPG [44] had stronger recommendations, whereas the SPCASGE CPG [38] had many weak recommendations. Among the CPGs examined, 235 recommendations were identified. Of these, 143 (64.4%) were classified as strong, whereas only 25 (17.5%) were supported by high-quality evidence. High-quality evidence is scarce, and this discrepancy in the proportions is noteworthy. We feel that this inconsistency may be related to the lack of rigor in the development of CPGs, which may be reflected in the scores of domain 3 of AGREE II (Rigour of development), so it may be of interest to investigate the correlation between the scores of domain 3 of AGREE II and the degree of inconsistency. Given the small number of CPGs included in our study, we plan to include more CPGs in the future to explore this association and to explore the reasons for the inconsistency between the strength of recommendations and the level of evidence. Besides, for some special patients, such as the elderly, children, pregnant women, patients with rare diseases, or responding to health emergencies, although the level of evidence is low, it still may provides important and effective clinical information, so developers of CPGs may give strong recommendations [56]. In addition, degree of some recommendations may be defined as strong even if they are not supported by high-level research evidence because of good observed clinical effects. However, it is important to note that strong recommendations necessitate a favorable balance between benefits, harms, and burdens, a criterion that is seldom fulfilled [56, 57]. Consequently, strong recommendations based on low-quality evidence are rarely appropriate. Expressing strong recommendations for recommendations that do not have high-quality evidence may lead healthcare practitioners to overemphasize certain treatments that do not work as expected. Therefore, in addition to focusing on the quality of the CPG domains, the production process of subsequent CPGs should consider the consistency of strong recommendations and high-quality evidence.

Three assessment tools were used to assess the methodological quality, recommendation quality, and reporting quality and to analyze the evidence and strength of CPG recommendations for GPL. Through a multidimensional comprehensive evaluation, we found some deficiencies in the development of current CPGs and identified room for improvement in the development of future CPGs. It can help improve the development process and reporting content for future CPGs, optimize CPG recommendations, and continuously improve the quality of CPGs. Moreover, the introduction of rigorously evaluated CPGs can help improve clinical practice [58].

A notable strength of this study lies in its utilization of the AGREE II, RIGHT, and AGREE-REX tools for evaluating CPGs. Incorporating multiple assessment tools allows for a more comprehensive examination, facilitating the identification of areas for improvement within CPGs. Moreover, the advantage of using multiple assessment tools is that they can check whether the aspects of concern of the CPG development team have received sufficient attention, thus helping to better grasp the full picture of CPG development. Simultaneously, the recommendations and evidence in the CPGs were analyzed, focusing on the consistency between the strength of the recommendations and the quality of evidence.

Regarding constraints, despite our comprehensive literature search of the database, it is possible that certain CPGs were not identified, and there may have been some CPGs that met the inclusion criteria but were inadvertently overlooked. Furthermore, our study was limited to CPGs published in English, which restricted the number of CPGs included in the analysis. Additionally, although AGREE II, RIGHT, and AGREE-REX can be employed to evaluate the methodology, reporting, and recommendation quality of CPGs, these assessment tools do not offer an exhaustive examination of the content within the CPGs. The sample size of this study was small, only 9 CPGs, which may lead to high randomness and low external validity of the results of subgroup analysis. The inclusion of only English literature should be an important reason for the small sample size. However, the results of this study still have certain reference significance for the development and research of related CPGs in the future.

Conclusion

Through this evaluation of the CPGs for GPL, we found that there are many deficiencies in the quality of the current CPGs for GPL, and the overall quality was low. The methodology, reporting, and recommendation quality of CPGs for GPL needs to be further improved. These two CPGs [4, 40] are the ones we think are the most applicable to clinicians. Future CPG development should adopt more rigorous and transparent criteria to ensure the creation and implementation of high-quality CPGs that effectively address the healthcare requirements of both patients and the general population. In terms of recommendations for CPGs, more attention should be paid to the consistency of the strength of recommendations and the quality of evidence in developing future CPGs to ensure that the strength of recommendations is appropriate.

Supplementary Information

Supplementary Material 1.

Bibliography1

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Institute of Medicine (US) Committee on Standards for Developing Trustworthy Clinical Practice Guidelines. Clinical Practice Guidelines We Can Trust. Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, editors. Washington (DC): National Academies Press (US); 2011.24983061 · pubmed ↗