Effectiveness and mechanisms of interventions to reduce low-value thyroid function tests: a systematic review
Carolina Pioch, Meik Hildebrandt, Gregor Goetz, Verena Vogt

TL;DR
This review examines interventions to reduce unnecessary thyroid tests, finding that structural changes like decision support systems are most effective.
Contribution
The study systematically evaluates the effectiveness of interventions to reduce low-value thyroid testing and identifies key factors for success.
Findings
Structural interventions, especially clinical decision support systems, were most effective in reducing low-value thyroid testing.
Most interventions (40 out of 54) reduced test rates by 20% or more, though evidence certainty was low to moderate.
Only four interventions referenced theoretical foundations or contextual factors, indicating a gap in understanding success drivers.
Abstract
Thyroid function tests are frequently overused. This systematic review aims to summarise the effectiveness of behaviour change interventions to reduce low-value thyroid testing and to identify theoretical foundations and contextual factors associated with their success. We conducted a comprehensive search of Medline, Embase, Scopus, and the Cochrane Library for randomised and non-randomised controlled trials as well as before-and-after studies. We followed PRISMA guidelines, critically appraised study quality, and applied the GRADE approach to assess certainty of evidence. We categorised interventions as soft (education, reminders, feedback, guidelines) or structural (change in funding, clinical decision support systems). We included 47 studies (54 interventions) including five randomised trials. Structural interventions, particularly clinical decision support systems, were the most…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1- —http://dx.doi.org/10.13039/501100014840Gemeinsame Bundesausschuss
- —Technische Universität Berlin (3136)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealthcare cost, quality, practices · Thyroid Disorders and Treatments · Health Promotion and Cardiovascular Prevention
Introduction
Thyroid function tests (TFTs) rank among the most frequently ordered laboratory tests worldwide [1], particularly thyroid-stimulating hormone (TSH) tests for diagnosing and managing thyroid disorders [2]. Alongside TSH, free hormone tests (free T3, T4) are often included in thyroid panels, further increasing the number of TFTs conducted [3]. For instance, approximately 10 million TFTs are performed annually in the UK at an estimated cost of £30 million [4, 5]. In Germany, about 30% of the adult population undergo thyroid function testing each year [6].
Despite the high volume of TFTs, the prevalence of thyroid dysfunction in Europe is only 3.8%, with 259.1 new cases per 100,000 people per year [7]. This discrepancy has raised concerns about the potential overdiagnosis and unnecessary treatment of conditions such as subclinical, asymptomatic hypothyroidism, which often resolves without intervention [8]. Unnecessary testing for suspected thyroid diseases appears to be widespread, frequently leading to additional and potentially avoidable medical procedures [9].
Clinical practice guidelines recommend against routine TSH screening for asymptomatic adults without known thyroid disease (consensus-based evidence, [10, 11]). Additionally, Free T3 or T4 tests are not advised for screening hypothyroidism unless pituitary or hypothalamic disease is suspected or known (consensus-based evidence, [12, 13]). Instead, TSH measurement is the most reliable method for detecting common forms of hypothyroidism and hyperthyroidism [14–16]. Despite the Choosing Wisely (CW) initiative’s aims to implement these guidelines, significant challenges remain in effectively translating them into practice.
Recent research identified a considerable amount of low-value care in German claims data, with 24.8% to 35.5% of TFTs deemed inappropriate [17, 18]. A bi-national study from Canada and the UK showed that approximately one-third of adult patients without a clear indication for testing underwent at least one TSH test within a 2-year period [19]. Similarly, a study from France suggests that inappropriate TFT ordering remains common [20]. TFTs therefore represent an emblematic example of low-value diagnostic testing, characterised by high baseline use, limited alignment with disease prevalence, and the potential to trigger downstream testing cascades. De-implementation of such unnecessary tests is frequently discussed in international literature [21]. Several systematic reviews have shown that targeted de-implementation strategies can effectively reduce low-value care [6]. Various interventions, including educational programmes for healthcare providers [22, 23], reminders [24, 25], decision support systems [26, 27], and stricter regulatory policies [28, 29], have been proposed to reduce unnecessary TFTs in different healthcare settings. These interventions can optimise clinical laboratory stewardship, contribute to cost savings, and improve healthcare resource allocation [3, 30].
Theoretical frameworks are essential for informing interventions aimed at de-implementing low-value care by identifying key elements that need to be addressed [31]. For instance, the Theoretical Domains Framework (TDF) explores barriers and facilitators for behaviour change [32], while the Choosing Wisely De-Implementation Framework systematically reduces low-value care [33]. A systematic review published by Zhelev et al. in 2016 provided an overview of behaviour change interventions to reduce the volume of TFTs ordered [34]. However, due to the studies’ poor methodological quality and reporting, strong conclusions could not be drawn, nor could specific intervention types be recommended. Furthermore, the theoretical foundations and contextual factors that contribute to the successful implementation of the interventions have not yet been thoroughly investigated. Since the publication of that review, a substantial number of new research related to interventions aimed at reducing the ordering of TFTs has emerged. In particular, as the latest studies included in the earlier review were published in 2014, before the widespread adoption of digital ordering systems and the increasing use of digital interventions, technological developments also need to be considered. In light of both the limitations of the earlier review and the growing body of relevant research, we are conducting a new systematic review, building on the work of Zhelev et al. [34].
Our review aims to identify effective strategies and their contexts by examining a series of interventions targeting the reduction of TFT orders. We address the following research questions (RQ):
- What is the effectiveness of behaviour change interventions in reducing the ordering of TFTs?
- Which theoretical foundations are used to explain the mechanisms underlying the interventions and which contextual factors are associated with the success of interventions aimed at improving evidence-based thyroid diagnostics?
In doing so, we build on the scope of the previous review: RQ1 was retained for continuity, while RQ2 was newly introduced to enhance the understanding of how and why interventions may work in different settings, drawing on information available in the included studies.
Methods
We first manually searched for well-conducted systematic reviews on the same topic and identified one published in 2016 [34]. The review was assessed for methodological quality and risk of bias, showing moderate quality according to the AMSTAR 2 tool [35, 36] and a low risk of bias (RoB) evaluated by the ROBIS tool (RoB In systematic reviews, [37]). Assessments are available in Additional files 1 and 2.
Our review was guided by the Cochrane methodology and followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement [38–40]. The PRISMA checklist is provided in Additional file 3.
Registration
We prospectively registered the review in PROSPERO (CRD42023492441). Changes to the information provided at registration are reported in Additional file 4.
Data sources, searches, and selection
We conducted a systematic literature search in line with the previous review by Zhelev et al. [34], using the same databases: MEDLINE (Ovid), Embase (Ovid), and the Cochrane Central Register of Controlled Trials (CENTRAL, the Cochrane Library). Searches were performed on the 21 st of November 2023. To increase comprehensiveness, we extended the methodology by also searching Scopus, screening the first 300 references from Google Scholar, hand-searching the reference lists of included articles, and contacting experts in the field to identify any additional relevant studies. We limited the search period for MEDLINE, Embase, and CENTRAL to articles published between the 1 st of January 2014 and the 21 st of November 2023, as articles published before 2014 were already screened for eligibility in the earlier review [34]. For Scopus and Google Scholar, no publication date limits were applied, as these databases were not included in the earlier review. We performed an update of the search on the 7th of July 2024.
We applied the same search strategy as Zhelev et al. [34], consisting of search terms related to (1) thyroid function tests and (2) inappropriate testing (Additional file 5). We included studies involving adults receiving TFTs in inpatient, outpatient, or emergency care settings. Eligible interventions comprised behaviour change interventions targeting physicians. Studies were required to compare these interventions to usual care and report outcomes related to test volume, appropriateness, costs, or patient health. We included randomised controlled trials (RCTs), non-randomised controlled studies, and before-and-after studies providing comparative data. We excluded studies that lacked a comparator, did not provide relevant outcome data, focused only on thyroid tests as part of a broader panel without disaggregated results, or were cross-sectional studies, opinion pieces, dissertations, or meeting abstracts. Only studies published in English or German were considered. All inclusion and exclusion criteria for assessing the effectiveness of the interventions (RQ1) followed those of the prior review and are outlined in Table 1. Table 1. Inclusion and exclusion criteria of studies based on the PICO framework (Population, Intervention, Control, Outcome)AttributeInclusion criteriaExclusion criteriaPopulationAdults receiving thyroid function testsInterventionBehaviour change intervention types [41, 42]: Educational interventions Guideline and protocol development and implementation Changes to funding policy Reminders of existing guidelines and protocols Clinical decision support systems, including test request forms and computer-based decision support Audit and feedbackControlUsual careOutcomeChange in the total number of thyroid function testsNumber of inappropriately ordered thyroid function testsTest-related expenditureHealth benefits to individual patientsStudies encompassing ordered thyroid function tests along with other laboratory tests but reporting only the average effect (across all tests)Study typeRandomised controlled trialsNon-randomised controlled studiesBefore-and-after studies providing comparative data on at least one of the outcome measuresCross-sectional studiesEditorials and opinionsStudies without comparative dataDissertations and meeting abstracts SettingInpatient care, outpatient care, emergency departmentLanguageEnglish or GermanAll other languages
To address our additional RQ2, we supplemented the prior methodology by searching grey literature and conducting targeted searches for additional publications by the authors of the included interventions. This was done to determine whether potential contextual factors and theoretical backgrounds were described in the development and implementation of the interventions under review. We imported all search results into Endnote software and removed any duplicates before proceeding to the screening phase [43]. Based on the pre-defined eligibility criteria, two review authors (CP, MH) independently conducted title-abstract and full-text screening. Discrepancies were resolved by consensus and by consulting a third researcher (VV). No automation tools were used in the process.
Data extraction and analysis
Data from the included studies were extracted by means of piloted data-extraction tables by two independent researchers (CP, MH). A list of all data items and detailed descriptions can be found in Additional file 6. Any discrepancies between the researchers were discussed until consensus was reached. For all identified RCTs, we contacted study authors via email in case of missing study protocols.
We used relative change (improvement/deterioration) as the standardised outcome metric, chosen due to the heterogeneity in outcomes and reporting. Following the approach outlined by Zhelev et al. [34], we set a threshold of ± 20% to interpret a relative change as large, indicating a substantial change in testing behaviour rather than minor variation. The outcomes included (1) changes in the total number of thyroid function tests, (2) test-related expenditure, (3) the number of inappropriately ordered thyroid function tests (appropriateness), (4) the pattern of ordering, and (5) the coefficient of variation (CoV) among physicians. We extracted the direction of the effect (positive/negative subject to the desired outcome measure). Confidence intervals and differences in means were extracted when reported. All calculated values are indicated as such.
We refrained from conducting a meta-analysis due to anticipated clinical heterogeneity among the studies. Specifically, we expected variability in intervention types, timing, and outcome measures, such as shifts towards TSH ordering, test volumes per population unit, or counts of laboratory tests per provider. Given these anticipated differences, a quantitative synthesis was not planned. Instead, we employed a narrative synthesis method. The analysis framework relied on an existing typology of behaviour change intervention types [41, 42], encompassing the following categories: (1) clinical decision support systems (CDSS), (2) changes to funding policy, (3) educational interventions, (4) reminders of existing guidelines and protocols, (5) guideline and protocol development and implementation, and (6) audit and feedback [34]. Furthermore, we introduced a grouping for the interventions and outcomes to support descriptive analysis and improve clarity of reporting (visualised in Additional file 7). We grouped the six types of interventions into structural and soft interventions based on their influence on physician behaviour. While we were guided by terminologies used in previous literature (active and soft [44], strict and soft [45], structural [46]), we developed the final categorisation to best reflect the characteristics of the interventions identified. The five outcome measures were divided into two categories: volume reduction (test rates, expenditure) and improvement of care (appropriateness, pattern, CoV).
Structural interventions directly influence physicians at the point of care and are integrated into their routine workflow, making them difficult to bypass. These interventions include changes in funding, where modifications to financial structures or incentives directly affect decision-making, and CDSS, which are embedded tools that directly guide or restrict physicians' choices during patient care. CDSS include alerts, changes in the existing order form, cost displays and reflex testing, respectively, as automatic discharge of tests. Soft interventions are those that provide physicians with informational resources or guidance outside of the immediate care setting, i.e. reminders, education, guidelines/protocols, and audit/feedback. These interventions are less directly tied to the point of care and may require active engagement from physicians to be used effectively, such as through educational meetings or reminder messages. They may also include tools that are not necessarily part of the standard working routine, like memorandum pocket cards or guidelines.
For RQ2, we extracted information on theoretical foundations and contextual factors when explicitly reported in the included studies or related publications. However, reporting was sparse and inconsistent, leaving little scope for synthesis or categorisation. Therefore, findings are presented narratively, with reference to specific theoretical models or quality improvement frameworks where available.
Bias assessment and certainty of evidence
We used the Cochrane RoB 2.0 tool (RoB 2) for (cluster) RCTs and the ROBINS-I tool (RoB In Non-randomised Studies—of Interventions) for non-randomised studies of interventions (NRSI [47, 48]). RoB figures were created for each outcome domain separately using the robvis application [49].
The interventions in our review were evaluated using the Grading of Recommendations, Assessment, Development, and Evaluations (GRADE) approach, following Murad et al.’s guide for complex interventions [50, 51]. We graded the evidence both by outcome—assessing the overall effectiveness of intervention bundles from a policymaker/payer perspective—and by grouped interventions and outcomes. This dual grading approach aimed to inform policymakers on the efficacy of intervention bundles and to guide practitioners on the most effective components for patient care. To minimise potential publication bias, we expanded the databases included in the initial review, performed a grey literature search, and looked into registered trials (EU Clinical Trials Register clinicaltrialsregister.eu/ctr-search/search, WHO trialsearch.who.int/Default.aspx). All assessments were performed independently by two reviewers (CP and MH). Any disagreements were resolved by consensus involving a third reviewer (GG).
Results
Study selection
From 2782 records screened, we identified 21 additional studies beyond those reported by Zhelev et al. [34], leading to a total of 47 unique studies included in this review. The updated search revealed no additional studies (213 new records screened). All identified studies were completed. Most of the identified studies used a before-and-after design (n = 34, including time series analysis), followed by nine non-randomised controlled studies and five (cluster) RCTs, two of which had a registration record. We contacted the authors of the remaining three studies and received a response from one of them. The study selection process is presented in Fig. 1. A list of all articles excluded after full-text screening and reasons for exclusion can be found in Additional file 8.Fig. 1PRISMA flowchart of the systematic literature search. TFTs, thyroid function tests
Study characteristics
Most studies were conducted in the USA (n = 16), Canada (n = 8), and the UK (n = 6). They were published between 1979 and 2022. The majority of the studies were conducted in outpatient (n = 21) and inpatient care (n = 14). The remainder were conducted in both or in emergency departments (n = 12, Tables 2 and 3). Most of the studies were performed at a single medical site (n = 29, Table 2). Table 2. Characteristics of the included studiesStudy and countryYearStudy designSettingTarget testsThyroid tests **Studies identified in present review (n = 21)Bateman et al., Canada [52]2019Before and after, single siteInpatient rehabilitation centreTSH + Vitamin DTSHBejjanki et al., USA [53]2018Before and after, single siteAcademic medical centre17 laboratory testsTSH, FT4Bellodi et al., Italy [54]2017Controlled study, multiple sites3 hospitals (10 wards)8 laboratory testsFT4Bradshaw et al., USA [55]2021Before and after, single siteAcademic Medical CentreTFTs onlyTSH, FT3, FT4Caldarelli et al., Italy [56]2017Before and after, single siteHospital (clinical laboratory)TFTs onlyTSH, FT3, FT4Chami et al., Canada [45]2021Time series analysis with control group, multiple sitesOutpatient laboratories8 laboratory testsTSHDalal et al., USA [57]2017Before and after, single siteUrban teaching hospitalTFTs onlyTSH, FT3, FT4Delvaux et al., Belgium [58]2020Cluster RCT72 primary care centres17 laboratory testsTSHElrewini et al., Saudi Arabia [59]2022Before and after, single siteArmed Forces HospitalTFTs onlyTSHGilmour et al., Canada [30]2017Before and after, single siteAcademic ambulatory hospitalTFTs onlyTSH, FT3, FT4Janssens et al., Netherlands [44]2015Before and after, single siteGeneral care and teaching hospital82 laboratory testsTSH, FT4Krouss et al., USA [60]2022Interrupted time series, multiple sites11 hospitals and over 70 ambulatory centresTFTs onlyT3, FT3Leis et al., Canada [61]2019Before and after, single siteCoronary Care Unit (teaching hospital)TFTs onlyTSHLeung et al., USA [62]2017Before and after, single siteResident clinic in an outpatient clinic70 laboratory testsTSH, FT3, FT4MacPherson et al., Australia [63]2005Before and after, single sitePre-admission clinic8 pathology tests + various investigationsTFTs (unspecified)Muris et al., Netherlands [64]2021Before and after, multiple sites57 general practices22 laboratory testsTSH, FT4Notas et al., Greece [65]2018Before and after, single siteTertiary teaching hospitalTFTs onlyTSH, FT3, FT4, TGAb TPOAbSalinas et al., Spain [66]2016Before and after, multiple sitesPublic University Hospital + 9 primary care centres13 laboratory testsTSH, FT4Sue et al., USA [67]2019Before and after, multiple sitesAll outpatients within urban tertiary/quaternary care academic health systemTFTs onlyT3Taher et al., Canada [68]2020Before and after, single siteTertiary hospitalTFTs onlyFT3, FT4Wintemute et al., Canada [69]2019Controlled study, multiple sites6 family health teamsTFTs onlyTSHStudies included in previous review (n = 27)Adlan et al., UK [70]2011Before and after, single siteMedical Assessment Unit (hospital, acutely ill patients)TFTs onlyTFTs (TSH, FT4, TPOAb, TRAb)Baker et al., UK [71]2003Cluster RCT33 general practices5 laboratory test groupsTFTs (TSH, FT4)Berwick and Coltin, USA [23]1986Controlled cross-over, multiple sites3 ambulatory centres (same health maintenance organization)13 laboratory and imaging testsT4Chu et al., Australia [72]2013Before and after, single sitetertiary teaching hospital emergency department23 laboratory testsTFTs (unspecified)Cipullo and Mostoufizadeh, USA [73]1996Before and after, single siteCommunity hospital20 laboratory tests and preoperative testing (unspecified)TFTs (T3RU)Daucourt et al., France [25]2000Cluster RCTGeneral and psychiatric hospitalsTFTs onlyTFTs (TSH, FT3, FT4, TRH)Dowling et al., USA [74]1989Before and after, single siteInner city community health centreTSH + CBCTSHEmerson and Emerson, USA [26]2001Before and after, single siteUniversity medical centreAll laboratory testsTSH, T3, FT4, T4, FTI/T3RUFeldkamp and Carey, USA [75]1996Before and after, single siteMetropolitan hospital and 22 satellite clinicsTFTs onlyTSH, T3, T4, FTI/T3RUGama et al., UK [76]1991Controlled study, single siteDistrict general hospital5 laboratory test groups + othersTFTs (TSH, FT4)Grivell et al., Australia [77]1981Before and after, single siteTertiary-care community hospital55 laboratory testsT4Hardwick et al., Canada [29]1982Before and after, multiple sitesOutpatient laboratoriesTFTs onlyT3, T4, ETRHorn et al., USA [78]2013Interrupted time series with control group, multiple sitesAlliance of 5 multispecialty group practices27 laboratory testsTSHLarsson et al., Sweden [22]1999Before and after, multiple sites19 primary care centres14 laboratory test groupsTSH, T3, FT4Mindemark and Larsson, Sweden (follow up) [79]2009Before and after, multiple sites16 primary healthcare centres12 laboratory test groupsTSH, T3, T4, FT4Nightingale et al., UK [80]1994Before and after, single siteSupra-regional liver unit (teaching hospital)Various laboratory testsTSHRhyne and Gehlbach, USA [81]1979Before and after, single siteFamily Medicine group practiceTFTs onlyTFTs (T3RU and T4)Schectman et al., USA [82]1991Controlled study, single sitePrimary care health maintenance organization practiceTFTs onlyTFTs (TSH, T3-RIA, T3RU, T4)Stuart et al., Australia [83]2002Before and after, single sitePublic hospital emergency department14 laboratory and 10 imaging testsTFTs (unspecified)Thomas et al., UK [24]2006Cluster RCT85 primary-care practices9 laboratory testsTSHTierney et al., USA [27]1988RCT, single siteAcademic internal medicine practice8 laboratory testsTSHTomlin et al., New Zealand [84]2011Controlled study, multiple sitesNew Zealand primary care8 laboratory testsTSH, FT3, FT4Toubert et al., France [85]2000Before and after, single siteTeaching hospitalTFTs onlyTSH, FT3, FT4, TPOAb, TGAb, TRAbVan Walraven et al., Canada [28]1998Interrupted time series, multiple sitesAll clinical laboratories (not based in hospitals)7 laboratory testsTSH, T3RU, T4Vidal-Trecan et al., France [86]2003Before and after, multiple sites50 university hospitalsTFTs onlyTSH, T3, FT3, T4, FT4Willis and Datta, UK [87]2013Before and after, single siteMedical admissions unit in a district general hospital3 laboratory test groupsTFTs (unspecified)Wong et al., USA [88]1983Controlled study, single siteUniversity teaching hospital6 laboratory testsTSH, T3RU,T3-RIA, T4-RIACBC complete blood count, ETR effective thyroxine ratio, FT4 free thyroxine, FT3 free triiodothyronine, FTI free T4 index or free thyroxine index, RCT randomised controlled trial, RIA radioimmunoassay, TFTs thyroid function tests, TGAb thyroglobulin antibodies, TPOAb thyroid peroxidase antibodies, TSH thyroid stimulating hormone (thyrotropin), T3 triiodothyronine, T3RU triiodothyronine resin uptake, TRH TSH-releasing hormone, TRAb TSH-receptor antibodiesTable 3Interventions of the included studiesStudy and countrySettingSoft interventionsStructural interventionsAudit and feedbackEducational programmesGuidelines and protocolsRemindersChanges to fundingCDSSDescription of decision tool****Studies identified in present review (n = 21)**Bateman et al., Canada [52]InpatientXX-Bejjanki et al., USA [53]InpatientXAlertBellodi et al., Italy [54]InpatientXAlertBradshaw et al., USA [55]Inpatient + EDXXXAlertCaldarelli et al., Italy [56]OutpatientXReflex/DischargeChami et al., Canada [45]OutpatientXChange of order formDalal et al., USA [57]InpatientXReflex/DischargeDelvaux et al., Belgium [58]OutpatientXAlertElrewini et al., Saudi Arabia [59]UnspecifiedXXXAlertGilmour et al., Canada [30]OutpatientXXReflex/DischargeJanssens et al., Netherlands [44]Inpatient + outpatientX-Krouss et al., USA [60]Inpatient + outpatientXAlertLeis et al., Canada [61]InpatientXChange of order formLeung et al., USA [62]InpatientXX-MacPherson et al., Australia [63]InpatientXXChange of order formMuris et al., Netherlands [64]OutpatientXCost displayNotas et al., Greece [65]Inpatient + outpatientXReflex/Automatic dischargeSalinas et al., Spain [66]Inpatient + outpatientXReflex/Automatic dischargeSue et al., USA [67]OutpatientXAlertTaher et al., Canada [68]Inpatient + outpatientXChange of order form + Reflex/Automatic dischargeWintemute et al., Canada [69]OutpatientXXX-**Studies included in previous review (n = 27)*Adlan et al., UK [70]InpatientX-Baker et al., UK [71]OutpatientXX-Berwick and Coltin, USA PCF [[23](#CR23)]OutpatientX-Berwick and Coltin, USA PCFY [[23](#CR23)]OutpatientX-Berwick and Coltin, USA TSE [[23](#CR23)]OutpatientX-Chu et al., Australia [[72](#CR72)]EDXChange of order formCipullo and Mostoufizadeh, USA [[73](#CR73)]InpatientX-Daucourt et al., France MPC [[25](#CR25)]InpatientX-Daucourt et al., France TRF [[25](#CR25)]InpatientXChange of order formDaucourt et al., France Both [[25](#CR25)]InpatientXXChange of order formDowling et al., USA [[74](#CR74)]OutpatientXX-Emerson and Emerson, USA [[26](#CR26)]OutpatientXChange of order formFeldkamp and Carey, USA [[75](#CR75)]Inpatient + outpatientXReflex/Automatic dischargeGama et al., UK [[76](#CR76)]Inpatient + outpatientX-Grivell et al., Australia [[77](#CR77)]InpatientX-Hardwick et al., Canada [[29](#CR29)]OutpatientXX-Horn et al., USA [[78](#CR78)]OutpatientXCost displayLarsson et al., Sweden [[22](#CR22)]OutpatientX-Mindemark and Larsson, Sweden (follow up) [[79](#CR79)]OutpatientX-Nightingale et al., UK [[80](#CR80)]InpatientXXXChange of order formRhyne and Gehlbach, USA [[81](#CR81)]OutpatientXX-Schectman et al., USA Reminder + Feedback [[82](#CR82)]OutpatientXXX-Schectman et al., USA Reminder [[82](#CR82)]OutpatientXX-Stuart et al., Australia [[83](#CR83)]EDXXX-Thomas et al., UK Feedback [[24](#CR24)]OutpatientX-Thomas et al., UK Reminder [[24](#CR24)]OutpatientX-Thomas et al., UK Both [[24](#CR24)]OutpatientXX-Tierney et al., USA [[27](#CR27)]OutpatientXAlertTomlin et al., New Zealand [[84](#CR84)]OutpatientXXX-Toubert et al., France [[85](#CR85)]Inpatient + outpatientXX-Van Walraven et al., Canada [[28](#CR28)]OutpatientXXXChange of order formVidal-Trecan et al., France [[86](#CR86)]Inpatient + outpatientXXXChange of order formWillis and Datta, UK [[87](#CR87)]InpatientXX-Wong et al., USA [[88](#CR88)]InpatientXXChange of order formMultiple interventions per study (indicated in study column) or multiple types of intervention per intervention (multi-component interventions) possible. Alerts including best practice pop-ups, test suggestions or information (e.g. test rates, disease probabilities)*CDSS* clinical decision support system, *ED* emergency department, *MPC* memorandum pocket card, *TRF* test request form, *TSE* test-specific education, *PCFY* peer comparison feedback on yield of tests, *PCF peer comparison feedback on cost of test use
The average total observation period of the studies was 25 months, with an average of 12-month preintervention phase and a 14-month intervention/postintervention period (additional information on study characteristics is provided in Additional file 9). In total, 19 studies targeted TFTs only, while the remaining 28 studies aimed to reduce a broader number of laboratory and imaging tests. A total of 54 distinct interventions were performed within the 47 studies (Table 2).
While the previous review identified soft interventions (education, guidelines/protocols, reminders, and audit/feedback) as the most common types [34], our review revealed that most of the studies introduced structural interventions (CDSS and changes in funding, n = 30). Of these, only two studies (no newly identified studies) assessed changes to funding (Table 3). Therefore, CDSS is the most widely used type (n = 28), with 17 of the 21 newly identified studies employing CDSS. The most common CDSS involved changes of the order form (n = 12), alerts (n = 8), or reflex with automatic discharge (n = 7). Alerts included information about best practice, suggestions for testing and discharge, or probabilities of thyroid disease. Additionally, two studies displayed the costs of the tests being ordered. Among the soft interventions, guidelines/protocols (n = 18) and education (n = 16) were most commonly employed, followed by audit/feedback (n = 14) and reminders (n = 9). In total, 19 interventions consisted solely of soft components, 25 involved structural changes, and ten combined both components (Table 3).
Bias assessment
The majority of the studies did not have a control group (n = 33). Following an approach outlined by HTACG, we refrained from conducting a formal RoB assessment for these studies, as their inherent lack of internal validity is unlikely to be changed by a RoB assessment [89].
Two RCTs were of low RoB; the remaining three had some concerns due to potential selection bias. Primarily due to the risk of confounding, the RoB in controlled studies (n = 9) was critical in three studies (n = 4 outcomes), serious in six studies (n = 8 outcomes), and moderate in one study. The outcomes within a study showed the same RoB for each domain due to the relation of the outcome measures. The full RoB assessment can be found in Additional file 10.
Effectiveness of the interventions
Of the 54 interventions, the majority showed a positive direction of change (n = 52), a considerable number had effects ≥ 20% (n = 40), and many reported significant changes (n = 29, with 19 not reported; Table 4) (RQ1). Further synthesis was impracticable due to the heterogeneity of measures, tests, and reporting standards. Only 12 interventions reported confidence intervals, while the difference in means was even less common (n = 7). Relative change ranged from a 32% decrease to a 172% improvement (additional information on study results and reported effect measures is provided in Additional file 11). We refrained from computing means due to the variability of measures within the outcomes. The most frequently assessed outcome was the number or rates of tests (n = 43), with 41 showing a positive effect. A large proportion of these had effects of ≥ 20% (n = 30). Significant changes were reported in 22 of the 43 interventions, while 17 did not report significance. Expenditure outcomes were less common (n = 9), but all showed positive changes, with significant changes reported in two interventions (six not reported). Relative improvements ≥ 20% were shown by five interventions assessing the expenditure. A total of 49 interventions assessed volume-related outcomes (test numbers/rates, expenditure; Table 4). Table 4. Results of the interventionsStudy and countryType of StudyInterventionDirection of effectRelative change \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 20\boldsymbol{\%}$$\end{document} Significant effectNotes on outcome measure AppropriatenessDaucourt et al., France MPC [25]RCTReminder + Decision tool** + ****−−Proportion of TFTs ordered in accordance with guidelinesDaucourt et al., France TRF [25]RCTReminder + Decision tool + **** + ***** + See aboveDaucourt et al., France Both [25]RCTReminder + Decision tool + **** + *NRSee aboveDelvaux et al., Belgium [58]RCTDecision tool + − + Appropriate tests/total ordersSchectman et al., USA Reminder + Feedback [82]ControlledReminder + Feedback + **** + *NRCompliance rate with TFT protocolSchectman et al., USA Reminder [82]ControlledReminder + **** + ***** + See aboveCaldarelli et al., Italy [56]UncontrolledDecision tool + ****−*NRPrescriptive appropriateness through the ratios TSH/FT4, TSH/FT3 and the ratio “TSH Reflex”/TSHDowling et al., USA [74]UncontrolledEducation + Feedback + **** + ****−Indicated TSH/visitElrewini et al., Saudi Arabia [59]UncontrolledEducation + Guidelines + − + Unnecessary requests/total TSH requestsFeldkamp and Carey, USA [75]UncontrolledDecision tool + **** + NRShift towards TSHLeis et al., Canada [61]UncontrolledDecision tool + **** + ***** + Proportion of not indicated physician TSH ordersNightingale et al., UK [80]UncontrolledGuidelines + Decision tool + Feedback + **** + NRPatients requiring an investigation/patients testedRhyne and Gehlbach, USA [81]UncontrolledEducation + Guidelines + **** + −High/low indication test proportionToubert et al., France [85]UncontrolledGuidelines + Reminders + **** + ***** + Frequency of appropriate use of thyroid function testsCoefficient of variation (CoV)Berwick and Coltin, USA PCF [[23](#CR23)]ControlledFeedback** + **** + ****NR**CoV of rate of test use among physicians within centresBerwick and Coltin, USA PCFY [[23](#CR23)]ControlledFeedback** + **** + ****NR**See aboveBerwick and Coltin, USA TSE [[23](#CR23)]ControlledEducation** + ****−****NR**See above**Expenditure**Tierney et al., USA [[27](#CR27)]RCTDecision tool** + ****−****−**Charges per visitTomlin et al., New Zealand [[84](#CR84)]ControlledEducation + Guidelines + Feedback** + ****−****NR**Bejjanki et al., USA [[53](#CR53)]UncontrolledDecision tool** + **** + *****NR**Cost savings from reducing duplicatesCaldarelli et al., Italy [[56](#CR56)]UncontrolledDecision tool** + ****−*****NR**Elrewini et al., Saudi Arabia [[59](#CR59)]UncontrolledEducation + Guidelines** + **** + *****NR**Cost spent on the unnecessary requests of TSH testsHardwick et al., Canada [[29](#CR29)]UncontrolledGuidelines + Change in Funding** + **** + ****NR**Change in total expected costsJanssens et al., Netherlands [[44](#CR44)]UncontrolledGuidelines** + **** + *****NR**Leung et al., USA [[62](#CR62)]UncontrolledEducation + Reminder** + ****NR**** + **Change of laboratory costsStuart et al., Australia [[83](#CR83)]UncontrolledEducation + Guidelines + Feedback** + **** + **** + **Mean costs per patient**Pattern**Tomlin et al., New Zealand [[84](#CR84)]ControlledEducation + Guidelines + Feedback** + **** + **** + **Shift towards TSHWong et al., USA [[88](#CR88)]ControlledGuidelines + Decision tool** + **** + ***** + **Sought to reduce complete thyroid panelsEmerson and Emerson, USA [[26](#CR26)]UncontrolledDecision tool** + **** + ***** + **Shift towards FT4 and thyroid cascadeHardwick et al., Canada [[29](#CR29)]UncontrolledGuidelines + Change in Funding** + **** + *****NR**Sought to reduce proportion of T3 testsLarsson et al., Sweden [[22](#CR22)]UncontrolledEducation** + ****−*****TSH/TFTs: + ****T3/TSH: + ****T4/TSH****: ****−**Shift towards TSH (primary care centres)Larsson et al., Sweden [[22](#CR22)]UncontrolledEducation** + **** + *****NR**Shift towards TSH (individual physicians)Mindemark and Larsson, Sweden (follow up) [[79](#CR79)]UncontrolledEducation** + ****T3/TSH + *****T4 + FT4/****TSH −*****−**Median of physicians; shift towards TSHToubert et al., France [[85](#CR85)]UncontrolledGuidelines + Reminders** + **** + *****NR**Shift towards TSHVan Walraven et al., Canada [[28](#CR28)]UncontrolledGuidelines + Change in Funding + Decision tool** + **** + **** + **Shift towards TSHVidal-Trecan et al., France [[86](#CR86)]UncontrolledEducation + Guidelines + Reminders + Decision tool** + ****−****NR**Shift towards TSH**Test numbers or rates**Baker et al., UK [[71](#CR71)]RCTGuidelines + Feedback** + ****−*****−**Tests per 1,000 patientsThomas et al., UK Feedback [[24](#CR24)]RCTFeedback** + ****−***** + **Tests per 10,000 patientsThomas et al., UK Reminder [[24](#CR24)]RCTReminder** + ****−***** + **See aboveThomas et al., UK Both [[24](#CR24)]RCTReminder + Feedback** + ****NR****NR**See aboveBellodi et al., Italy [[54](#CR54)]ControlledDecision tool** + ****Delta****: ****−****Cento****: **** + ****Ferrara****: ****−****NR**Number of laboratory tests requested by wardsBerwick and Coltin, USA PCF [23]ControlledFeedback + ****−****NRTests per 1,000 encounters per physicianBerwick and Coltin, USA PCFY [23]ControlledFeedback− + ****NRSee aboveBerwick and Coltin, USA TSE [23]ControlledEducation + −NRSee aboveChami et al., Canada [45]ControlledDecision tool + −−Number of thyroid testsGama et al., UK [76]ControlledFeedback + **** + I: **** + C: −Tests per outpatient visitHorn et al., USA [78]ControlledDecision tool + NA−Monthly orders per 1,000 patientsSchectman et al., USA [82]ControlledReminder + Feedback + − + Number of TFTs per patient; Feedback and Non-Feedback group combinedTomlin et al., New Zealand [84]ControlledEducation + Guidelines + Feedback + TSH: ****−FT3/FT4: **** + **** + Tests per year per GPWintemute et al., Canada [69]ControlledGuidelines + Feedback + − + Wong et al., USA [88]ControlledGuidelines + Decision tool + TSH and T3RIA: ****+ T3RU and T4RIA: ****−****NRTests per monthAdlan et al., UK [70]UncontrolledGuidelines + **** + ***** + Proportion of admitted patients offered TFTsBateman et al., Canada [52]UncontrolledEducation + Feedback + **** + ****NRProportion of admitted patients offered TFTsBejjanki et al., USA [53]UncontrolledDecision tool + FT4: ****−****TSH: + FT4: ****−TSH: **** + Percentage change in the number of inpatient duplicate ordersBejjanki et al., USA [53]UncontrolledDecision tool + FT4: ****−TSH: **** + FT4: ****−TSH: **** + Odds of percentage duplicateBradshaw et al., USA [55]UncontrolledDecision tool + **** + TSH: ****−****FT4: + Number of inappropriate TSH tests ordered; FT3 excluded due to low baseline numbersCaldarelli et al., Italy [56]UncontrolledDecision tool + TSH: ****−FT4: ****−FT3: **** + ****NRNumber of thyroid testsChu et al., Australia [72]UncontrolledDecision tool + **** + ***** + Number of tests ordered per 100 ED presentationsCipullo and Mostoufizadeh, USA [73]UncontrolledGuidelines + ****−****NRTests/dischargeDalal et al., USA [57]UncontrolledDecision tool + **** + **** + Number of tests of fT3 and fT4 orders per total TSH ordersDowling et al., USA [74]UncontrolledEducation + Feedback + **** + ****−Rates of ordering TSH tests per visitEmerson and Emerson, USA [26]UncontrolledDecision tool + **** + ***** + Test sets ordered (significance for total TFTs)Feldkamp and Carey, USA [75]UncontrolledDecision tool + TSH: −T4: **** + T3RU: **** + NRTests per 1,000 patients (T3 not reported)Gilmour et al., Canada [30]UncontrolledEducation + Decision tool + **** + **** + Median number of tests performed (FT3 and FT4; TSH used for initial appropriateness)Grivell et al., Australia [77]UncontrolledFeedback− + *NRTests per 1,000 patientsHardwick et al., Canada [29]UncontrolledGuidelines + Change in Funding + **** + *NRJanssens et al., Netherlands [44]UncontrolledGuidelines + **** + *NRKrouss et al., USA [60]UncontrolledDecision tool + **** + ***** + Orders per 1,000 patient days (inpatient)/per 1,000 encounters (outpatient)Leis et al., Canada [61]UncontrolledDecision tool + **** + ***** + Patients with any TSH assay request/patients with physician-signed orderMacPherson et al., Australia [63]UncontrolledGuidelines + Decision tool + **** + ***** + Muris et al., Netherlands [64]UncontrolledDecision tool + **** + ***** + Mean test ordering rate per 1,000 patients per month per general practiceNotas et al., Greece [65]UncontrolledDecision tool + **** + **** + Number of TFTs per TSH ordered (FT4 and FT3) and per cent patients with TFT order, inpatientsNotas et al., Greece [65]UncontrolledDecision tool + **** + NRNumber of TFTs per TSH ordered (FT4 and FT3), outpatientsRhyne and Gehlbach, USA [81]UncontrolledEducation + Guidelines + − + TFTs per 100 patientsSalinas et al., Spain [66]UncontrolledDecision tool + **** + *NRRatio of FT4/TSHSue et al., USA [67]UncontrolledDecision tool + **** + **** + T3 laboratory tests/10,000 patients per weekTaher et al., Canada [68]UncontrolledDecision tool + **** + ****NRTotal number of fT4 and fT3 tests per monthToubert et al., France [85]UncontrolledGuidelines + Reminders + **** + *NRVan Walraven et al., Canada [28]UncontrolledGuidelines + Change in Funding + Decision tool + TSH: ****−T4: **** + **** + Tests per 100,000 patients per month; comparison with expected values (T3RU not reported)Vidal-Trecan et al., France [86]UncontrolledEducation + Guidelines + Reminders + Decision tool + ****−***NRWillis and Datta, UK [87]UncontrolledEducation + Guidelines + **** + ***** + **Tests per admissionInterventions sorted by outcome and type of study. Effects based on numerical results that can be found in Additional file 11. Deviations from standard outcome measure listed in last column^^Based on authors’ calculations (for values pre/postintervention see Additional file 11)ED emergency department, FT4 free thyroxine, FT3 free triiodothyronine, GP general practitioner, MPC memorandum pocket card, NR not reported, TFTs thyroid function tests, TRF test request form, TSE test-specific education, TSH thyroid stimulating hormone (thyrotropin), T3 triiodothyronine, T3RU triiodothyronine resin uptake, PCFY peer comparison feedback on yield of tests, PCF$ peer comparison feedback on cost of test use, RCT randomised controlled trial, RIA radioimmunoassay
Improvement-related outcomes (appropriateness, pattern, and CoV) were assessed in 24 interventions. Appropriateness was frequently studied (n = 14), with all showing positive direction, many with effects ≥ 20% (n = 10), and significant changes in six studies (five not reported). Pattern changes were less common (n = 8), but all showed positive effects, with significant changes in some (n = 4, with three not reported). Effects of ≥ 20% were shown in three interventions. CoV outcomes were least assessed (n = 3), with no significant changes reported (two interventions showed relative improvements ≥ 20%). Over all outcomes and interventions, the results of structural interventions were slightly more positive, with 100% showing positive effects (61% significant) and 74% with large effect sizes, compared to combined and soft interventions (combined 100% positive (40% significant), 60% large effect; soft, 94% positive (44% significant), 47% large effects; Additional file 11).
To contextualise these findings, we next evaluated the certainty of evidence using GRADE. For structural interventions (CDSS, changes in funding), we found a significantly positive effect on the outcomes that measure improvement of care based on two cluster RCTs (n = 1 effect ≥ 20%, high certainty of evidence (CoE)). Four uncontrolled studies supported these findings (positive direction, n = 3 effects ≥ 20%, two significant). For volume reduction, one RCT indicated a trend towards reducing test rates (positive, not significant (NS), low CoE), with 16 non-randomised studies pointing in the same direction (positive direction, n = 12 effects ≥ 20%, n = 9 significant). For soft interventions, one cluster RCT indicated positive improvement of care (NS, moderate CoE), as well as ten non-randomised interventions (n = 9 effects ≥ 20%, four significant). Two cluster RCTs featuring four soft interventions indicated they could achieve volume reduction (two significant, moderate CoE). Of 19 non-randomised interventions, 17 showed a positive direction as well (n = 9 effects ≥ 20%, nine significant). Similarly, combined interventions showed positive effects on the improvement of care based on one cluster RCT* (*positive, effect ≥ 20%, NS, moderate CoE) and six non-randomised interventions (n = 3 effects ≥ 20%, n = 3 significant). No RCT assessed volume reduction, but eight non-randomised interventions showed a positive trend (n = 5 effects ≥ 20%, n = 3 significant). The full GRADE evidence profiles can be found in Additional file 7, organised by intervention type (Table 7.1) and outcome category (Table 7.2).
Most of the studies had unreported funding (n = 21), some were non-profit (n = 14), lacked a specific grant (n = 12), or had unclear funding (n = 1). Twenty-two studies reported ethics committee approval (n = 18) or stated that approval was not required (n = 4). Twenty-five studies did not report on ethics approval (Additional file 12 includes information on ethics approval, funding, and conflict of interest). Although we cannot entirely rule out an overestimation of the predominantly positive findings, there is no indication of significant publication bias. No relevant trials were found in the extended registry search. All evidence of non-randomised trials was rated to be of very low certainty (Additional file 7).
Theoretical foundations and contextual factors
Information on theoretical foundations and contextual factors provided by the included studies was sparse (RQ2). Four interventions reported on the theoretical foundations of their interventions, going further than conventional references to systematic reviews or guidelines (relevant text passages included in Additional file 9). We did not find additional literature reporting on theoretical foundations or contextual factors. Consideration of contextual factors beyond theoretical models was not explicitly mentioned in any included study.
Elrewini et al. (education + guidelines + retest alert) performed a root cause analysis to develop a corresponding action plan that implements the identified root causes [59]. Leis et al. (change of order form) used a simulated setting to assess unnecessary test ordering through a hypothetical patient scenario with a quasi-randomised sample of participants. They concluded that the presence of a checkbox influences ordering behaviour [61, 90]. Stuart et al. (feedback + education + guidelines) based the components of their intervention on the core elements of the PRECEDE framework (Predisposing, Reinforcing, and Enabling Causes in Educational Diagnosis and Evaluation, [41, 83]). Wintemute et al. (guidelines + feedback + reminder) based their choice of intervention on Rogers’ theory of diffusion of innovations, supplementing evidence-based recommendations with active reminders and local feedback ([69, 91], Additional file 9).
Additionally, three interventions performed Plan-Do-Study-Act (PDSA) cycles based on different approaches. Bateman et al. (feedback + education) applied a systematic approach using process measures and evaluation based on quality improvement literature [52, 92, 93]. Gilmour et al. (education + reflex testing) and Taher et al. (reflex testing) developed their interventions based on PDSA cycles using the model for improvement framework for continuous quality improvement by Provost et al. ( [30, 68, 94, 95], Additional file 9).
Discussion
Our review sought to evaluate interventions aimed at reducing unnecessary TFTs by reviewing and synthesising recent studies, building on the review by Zhelev et al. from 2016 [34]. We identified 21 new studies, contributing to a total of 47 unique studies included in our review. The interventions comprised soft (education, guidelines/protocols, reminders, and audit/feedback) and structural (CDSS and changes in funding) interventions. The synthesis of 54 interventions across the included studies revealed predominantly positive outcomes, with 52 interventions associated with reductions or improvements in at least one outcome. Most studies reported relative reductions of at least 20%. Restricting the evidence to the five included (cluster) RCTs reaffirmed this pattern, as all five trials reported some degree of beneficial impact. In this review, we applied the Cochrane methodology and the GRADE approach, enhancing the methodological rigour [38, 50]. Nevertheless, the overall certainty of evidence was rated as low, indicating that the observed effects should be interpreted with caution and primarily viewed as indicative of promising trends rather than definitive evidence of effectiveness.
We observed a shift towards structural interventions, particularly CDSS. Of the 21 newly identified studies, 17 employed some form of CDSS, with alerts being the most commonly reported. The clustered GRADE assessment suggested potential effectiveness of de-implementation interventions, where structural interventions showed slightly more compelling results (RQ1). RCT evidence for structural interventions indicated improvement of care (moderate CoE, n = 2 RCTs) and volume reduction (low CoE, n = 1 RCT). Though RCT evidence for soft interventions shows similar results (volume reduction n = 2 RCTs, improvement of care n = 1 RCT, both moderate CoE), observational evidence suggests a higher success rate for structural interventions (structural interventions: 100% positive, 61% significant; soft interventions: 94% positive, 44% significant). The increased use of structural interventions aligns with the expectation that direct approaches at the point of care may be more likely to yield a reduction in test orders compared to soft interventions [45]. A systematic review performed by Cliff et al. on the effectiveness of CW interventions concluded that structural interventions are more effective than soft approaches [21]. Similarly, a CDSS was found to be the most effective intervention in de-implementing low-value cancer care [96]. Further research is needed to evaluate the effectiveness of structural interventions, under various conditions such as system usability, integration with existing practices, and user engagement, which are often underreported [97]. In particular, it remains unclear how differences in CDSS design, implementation context, and integration into clinical workflows shape both the magnitude and durability of observed effects.
Next to potentially promising structural interventions, our study also identified soft interventions that, while less impactful, showed compelling effects in some settings. These findings suggest that both structural and soft interventions may be suitable options for reducing TFTs. Similarly, research by Kobewka et al., which examined interventions aimed at reducing all sorts of laboratory test utilisation, found positive results across all types of interventions [98]. In particular, soft interventions can be considered in settings where profound structural interventions are not feasible. Regardless of the specific intervention type, a recent overview of reviews by Kien et al. indicates that de-implementation strategies are effective across various low-value services [6]. This overview complements our review and can inform decision-makers about the range of interventions available for reducing low-value care, as well as the potential for de-implementation strategies beyond individual indications.
Further, our review sought to identify theoretical foundations considered during implementation and contextual factors that are associated with the effectiveness of the interventions (RQ2). However, reporting on these aspects was sparse in the identified studies and grey literature. Only a few studies reported using frameworks to guide their interventions. This limits our understanding of the mechanisms driving the observed effects and reduces the ability to replicate successful interventions in different settings, as the lack of theory-driven design makes it difficult to explain how contextual and behavioural factors influence effectiveness. For example, Stuart et al. based the components of their intervention on the core elements of the PRECEDE framework thereby aligning components with identified determinants of behaviour ([83], significant positive effect with RD ≥ 20%). Similarly, Wintemute et al. drew on Rogers’ theory of diffusion of innovations to supplement evidence-based recommendations with active reminders and local feedback ([69], significant positive effect). These examples illustrate how theoretical models can strengthen intervention design by making explicit the mechanisms through which behaviour change is expected to occur. However, despite the availability of several de-implementation frameworks, they appear to be rarely applied in practice [99]. A more consistent use of such frameworks across interventions would not only facilitate comparability but also strengthen the theoretical grounding of de-implementation strategies. Integrating examples such as digital readiness or organisational culture within such frameworks could help clarify how contextual factors interact with intervention mechanisms. Policymakers should support evidence-based interventions built on robust theoretical foundations and evaluation frameworks to ensure effectiveness and lasting impact by avoiding inefficient components. Once solid evidence has been generated through primary studies, a realist review may help to fully understand how and why different components of the intervention(s) work in what contexts and for whom [100]. The success of such system-level changes depends heavily on contextual conditions, such as the availability and interoperability of local infrastructure, prevailing funding mechanisms, and regulatory frameworks. These factors determine whether an intervention can be feasibly implemented and sustained, and they should be carefully considered when transferring findings to other settings. Although some studies reported larger effects for changes in the clinical ordering system compared to soft interventions, it cannot be assumed that such approaches are simultaneously less expensive, despite the absence of recurring training sessions and evaluations [98]. While CDSS may prove cost-efficient over time through automation and scalability, they often demand substantial initial investments in digital infrastructure. In contrast, soft interventions are typically less costly to implement initially but may require ongoing efforts to sustain their effects. This trade-off should be carefully considered when designing de-implementation strategies. Cost-effectiveness analysis is necessary to evaluate the costs of interventions relative to their savings, as interventions aimed at reducing low-value care can themselves be resource-intensive.
Limitations
There are several limitations to this review that should be acknowledged. First, we included observational studies due to the complexity of the interventions. Observational studies are generally more prone to internal validity concerns compared to RCTs [38]. Second, we did not pool the effects of the interventions quantitatively because of the heterogeneity of the interventions and outcome measures. Instead, we focused on the evaluation of clustered results in order to give a concise overview. While the classification into structural or soft interventions is partly based on literature, the distinction is not always clear-cut, as some interventions span both categories or include borderline elements, and several interventions in this review explicitly combined soft and structural components. Alternative ways of grouping interventions and outcomes may therefore lead to different interpretations of the findings. In addition, conducting an extended mixed-methods synthesis integrating qualitative and quantitative evidence was beyond the scope of this review, given the sparse and inconsistent reporting of theoretical foundations and contextual factors. Third, we frequently observed small effects, which may limit the strength of the conclusions, though the overall trends were generally positive. Still, only 12 of the included interventions reported confidence intervals, limiting the ability to assess the reliability of effect estimates. The lack of confidence intervals makes it more difficult to determine the statistical robustness of reported changes, and increases the uncertainty surrounding the true effectiveness of the interventions. Furthermore, the majority of studies reported outcomes over relatively short timeframes, with few studies providing extensive follow-up. Consequently, it remains unclear whether reductions in TFT ordering were sustained once interventions ended or whether rebound effects occurred, for example due to alert fatigue, CDSS-related workflow integration issues, or system changes. Future research should address the long-term sustainability of de-implementation efforts. Fourth, most of the included studies were conducted in North America and Europe, particularly the USA, Canada, and the UK. This may limit the generalisability to health systems in other regions, especially those with low resources or limited digital infrastructure. In such settings, soft interventions may be more readily feasible than CDSS-based approaches that require specific digital and regulatory prerequisites. In addition, the interventions under study may not be readily transferable to other settings in the countries of interest. For example, in the German healthcare system, there are various electronic health data management systems and providers of practice management software, each with varying levels of interoperability. The context in which interventions are deployed could produce different outcomes based on the technical and regulatory landscape. Regulatory efforts are necessary in order to incorporate customised solutions across multiple institutions simultaneously. Fifth, the literature search was restricted to English and German publications. While no German-language studies were identified and several included studies originated from non-English-speaking countries, the exclusion of other languages may have resulted in the omission of a small number of relevant studies.
Regarding publication bias, research on TFT reduction is predominantly publicly funded, with fewer incentives for researchers to withhold information in comparison to pharmaceutical trials and related reviews. Any intervention aiming to reduce low-value TFT tests is likely to lead to positive results compared to usual care. Meanwhile, significant outcomes (31 interventions, 57%) were not reported in excessive frequency in relation to non-significant outcomes or outcomes with no reported level of significance. Thus, we concluded that publication bias is not a major concern. However, the predominance of non-RCTs led to a generally low CoE, limiting generalisability. Most controlled studies had a serious or critical RoB due to potential confounding, while observational studies were categorically classified as having a critical RoB. Last, we cannot rule out the possibility that our findings are influenced by selective reporting of outcomes. However, similar to the issue of potential publication bias, selective reporting is unlikely to pose a significant problem, as the body of evidence includes enough compelling and consistently positive results. The identified limitations coincide with those found in similar reviews and research on related low-value care topics [21, 96, 101]. Thus, the implications for policy should be interpreted with the appropriate caution, taking into account the GRADE results (Additional file 7).
Conclusion
The evidence on de-implementation strategies for TFT ordering suggests that behaviour change interventions have the potential to significantly reduce excessive thyroid function testing. Particularly CDSS appear to be associated with promising results, though most studies are of high or critical risk of bias. If these findings hold in more rigorous trials, this would strengthen the evidence base for feasible workflow modifications to improve care. Policy and practice could then consider implementing controlled TFT reduction as a means to enhance appropriateness and possibly reduce costs. Continued research on cost-effectiveness will be essential to inform large-scale implementation.
Future research should focus on developing well-designed interventions based on a solid theoretical foundation and higher methodological rigour. In particular, RCTs and the use of hybrid effectiveness-implementation designs would allow more reliable evaluation and applicability. By systematically reporting contextual factors and mechanisms of change, future studies can strengthen the evidence base and support the replication of potentially effective de-implementation strategies across settings. In the short term, standardised cluster RCTs across diverse contexts could test the effectiveness of CDSS, while medium-term studies should assess the sustainability of their effects. In the longer term, theory-based realist syntheses may help clarify what works, for whom, and why. This review can help inform the design of such interventions by identifying which specific interventions or components may be associated with greater effectiveness. Ongoing evaluation of these studies can identify the mechanisms of change and facilitate the replication of successful de-implementation interventions across various settings.
Supplementary Information
Additional file 1. Additional file 1 includes the AMSTAR assessment of the review by Zhelev et al [34].Additional file 2. Additional file 2 includes the ROBIS assessment of the review by Zhelev et al. [34].Additional file 3. Additional file 3 includes the PRISMA 2020 Checklist.Additional file 4. Additional file 4 includes changes made to the information provided at registration.Additional file 5. Additional file 5 includes the search strategies in Embase, Medline, Scopus, Cochrane, and Google Scholar.Additional file 6. Additional file 6 includes the list of data items extracted in the review.Additional file 7. Additional file 7 includes the full GRADE assessment of the interventions.Additional file 8. Additional file 8 includes the information on all articles excluded after full-text screening, including reason for exclusion.Additional file 9. Additional file 9 includes additional information on study characteristics, i.e. reported outcomes, reporting on theoretical foundations, and study period.Additional file 10. Additional file 10 includes the visualisation of the Risk of Bias assessment for the (cluster) RCTs and controlled studies.Additional file 11. Additional file 11 includes additional information on study results, i.e. the outcome values (pre/postintervention), notes on outcome measures and statistical indicators (confidence interval, p-value, relative reduction, difference in means).Additional file 12. Additional file 12 includes additional information on study characteristics, i.e. funding and reported conflict of interest.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Premawardhana LD. Thyroid testing in acutely ill patients may be an expensive distraction. Biochemia medica. 2017;27(2):300–7.10.11613/BM.2017.033PMC 549317028694722 · doi ↗ · pubmed ↗
- 2Kien C, Daxenbichler J, Titscher V, Baenziger J, Klingenstein P, Naef R et al. Effectiveness of de-implementation of low-value healthcare practices: an overview of systematic reviews. Implement sci: IS. 2024;19(1):56.10.1186/s 13012-024-01384-6PMC 1129941639103927 · doi ↗ · pubmed ↗
- 3Schübel J, Voigt K, Uebel T. Elevated TSH values in primary care: DEGAM-Guideline. AWMF-Register-Nr. 053–046; 2023 [cited 2024 Jul 24]. Available from: https://register.awmf.org/assets/guidelines/053_D_Ges_fuer_Allgemeinmedizin_und_Familienmedizin/053-046k-eng_S 2k_Erhoehter-TSH-Wert-in-der-Hausarztpraxis_2023-07.pdf.
- 4Cure C. Screening for thyroid dysfunction: do not routinely order TSH in all patients: Canadian Task Force on Preventive Health Care; 2024 [cited 2024 Jul 24]. Available from: URL: https://canadiantaskforce.ca/screening-for-thyroid-dysfunction-do-not-routinely-order-tsh-in-all-patients/.
- 5Canadian Society of Endocrinology and Metabolism. Five things patients and physicians should question: endocrinology and metabolism - Choosing wisely Canada; 2020 [cited 2024 Jul 24]. Available from: URL: https://choosingwiselycanada.org/recommendation/endocrinology-and-metabolism/.
- 6Choosing Wisely Australia. The Endocrine Society of Australia: recommendations; 2024 [cited 2024 Jul 1]. Available from: URL: https://www.choosingwisely.org.au/recommendations/esa 5.
- 7Canadian Society of Endocrinology and Metabolism. CSEM review and response: thyroid testing and management; 2024 [cited 2024 Jul 24]. Available from: URL: https://www.endo-metab.ca/cpgs-qi/thyroid-testing.
- 8Baranek H, Lee J. Less is more with T and T 4: Choosing Wisely Canada; 2018 [cited 2024 Jul 24]. Available from: URL: https://choosingwiselycanada.org/less-t 3-t 4/.
