Concordance in assessments between investigators and blinded independent central review (BICR) in hematology oncology clinical trials: a meta-analysis

Xiaoyu Tang; Yang Dang; Siying Han; Bohan Cui; Yi Kang; Xiaoyu Luo; Hui Zhang

PMC · DOI:10.1093/oncolo/oyaf375·November 9, 2025

Concordance in assessments between investigators and blinded independent central review (BICR) in hematology oncology clinical trials: a meta-analysis

Xiaoyu Tang, Yang Dang, Siying Han, Bohan Cui, Yi Kang, Xiaoyu Luo, Hui Zhang

PDF

Open Access

TL;DR

This study finds that in hematology cancer trials, investigators and blinded reviewers agree closely on results, suggesting BICR may not be as necessary as in other cancer types.

Contribution

The study is the first meta-analysis to evaluate BICR-investigator concordance specifically in hematology trials.

Findings

01

Pooled hazard ratio ratio for PFS was 0.96, indicating strong agreement between investigators and BICR.

02

Pooled odds ratio ratio for ORR was 0.99, showing minimal difference in response rate assessments.

03

Subgroup analyses across cancer types and trial sizes consistently showed high concordance.

Abstract

Blinded independent central review (BICR) mitigates assessment bias in oncology trials but imposes significant operational burdens. Its value in hematologic malignancies—where multimodal response criteria reduce reliance on subjective imaging assessments compared to solid tumors—remains unestablished. This meta-analysis evaluates BICR-investigator concordance specifically in hematology trials. We systematically identified Phase II/III hematology trials (2014-2024) reporting progression-free survival (PFS) and/or objective response rate (ORR) assessments by both investigators and BICR from PubMed. Agreement was quantified using Pearson/Spearman correlation, pooled hazard ratio ratio (HRR, HRINV/HRBICR) for PFS, and odds ratio ratio for ORR (OddsRR, ORINV/ORBICR). We also analyzed the odds ratio for ORR for single arms (OddsINV/OddsBICR). Subgroup analyses assessed the impact of masking,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

hematologic malignancies cancer

Figures2

Click any figure to enlarge with its caption.

HR estimates and 95% confidence intervals by investigator and BICR assessments.Note: The 2 points without error bars have reported significant p-values but no confidence intervals.

Tables3

Table 1.. Descriptive summaries of characteristics.

Characteristics	PFS comparisons (n = 37)	ORR comparisons (n = 23)	Single-arm comparisons (n = 29)
Masking
Open-label	29 (78.4%)	21 (91.3%)	29 (100%)
Blinded	8 (21.6%)	2 (8.7%)	0
Phase
Phase II	2 (5.4%)	2 (8.7%)	27 (93.1%)
Phase III	35 (94.6%)	21 (91.3%)	2 (6.9%)
Sample size
≤250	6 (16.2%)	5 (21.7%)	29 (100%)
250-350	9 (24.3%)	3 (13.0%)	0
350-450	9 (24.3%)	8 (34.8%)	0
>450	13 (35.1%)	7 (30.4%)	0
Cancer type
Lymphoma
cHL	2 (5.4%)	0	5 (17.2%)
iNHL	1 (2.7%)	1 (4.3%)	1 (3.4%)
FL	1 (2.7%)	2 (8.7%)	5 (17.2%)
MCL	2 (5.4%)	2 (8.7%)	1 (3.4%)
CTCL	1 (2.7%)	1 (4.3%)	0
PTCL	2 (5.4%)	0	4 (13.8%)
DLBCL	3 (6.1%)	1 (4.3%)	2 (6.9%)
ENKTL	0	0	1 (3.4%)
LBCL	0	0	2 (6.9%)
WM	1 (2.7%)	0	0
MZL	0	0	2 (6.9%)
TFHL	1 (2.7%)	0	0
FL/MZL	1 (2.7%)	1 (4.3%)	0
DLBCL/FL	1 (2.7%)	0	0
Leukemia
CLL	9 (24.3%)	7 (30.4%)	0
ALL	0	1 (4.3%)	0
Myeloma
MM	6 (16.2%)	3 (13.0%)	3 (10.3%)
Leukemia/Lymphoma
CLL/SLL	5 (13.5%)	3 (13.0%)	2 (7.4%)
T-cell Leu/Lym	0	0	1 (3.4%)
Myeloid
MDS	1 (2.7%)	1 (4.3%)	0

Table 2.. Agreement assessment of PFS between BICR and investigators.

Characteristics		Correlation r (95%CI)		HRR (95%CI)
Characteristics		n	Value	n	Value
Overall		37	0.97* (0.91, 0.99)	35	0.96 (0.89, 1.03)
Masking
	Open label	29	0.96* (0.87, 0.99)	27	0.95 (0.87, 1.03)
	Blinded	8	0.99 (0.97, 1.00)	8	1.01 (0.87, 1.18)
Cancer Types
	Imaging-required for all patients^a	30	0.97* (0.89, 0.99)	29	0.95 (0.87, 1.03)
	Imaging-required for partial patients^b	7	0.98 (0.89, 1.00)	6	1.00 (0.86, 1.16)
Sample Size
	Sample size ≤ 350	15	0.95 (0.86, 0.98)	13	0.95 (0.81, 1.11)
	Sample size > 350	22	0.96* (0.78, 0.99)	22	0.96 (0.89, 1.05)

Table 3.. Agreement assessment of ORR between BICR and INV.

Characteristics		Correlation r (95%CI)		ORR (95%CI)
Characteristics		n	Value	n	Value
Overall		23	0.92 (0.83, 0.97)	23	0.99 (0.85, 1.14)
Masking
	Open label	21	0.92 (0.81, 0.97)	21	1.01 (0.87, 1.18)
	Blinded	2	NA	2	NA
Caner Types
	Imaging-required for all patients ^a	18	0.90 (0.76, 0.96)	18	0.99 (0.84, 1.18)
	Imaging-required for partial patients ^b	5	0.98 (0.72, 1.00)	5	0.97 (0.73, 1.30)
Sample Size
	Sample size ≤ 350	8	0.91 (0.55, 0.98)	8	0.91 (0.67, 1.24)
	Sample size > 350	15	0.94 (0.82, 0.98)	15	1.01 (0.86, 1.20)

Equations3

Funding2

—Research Development Fund
—Xi’an Jiaotong-Liverpool University10.13039/501100006683

Keywords

hematologyblinded independent central reviewprogression free survivalobjective response ratesclinical trialstumor assessments

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMeta-analysis and systematic reviews · Statistical Methods in Clinical Trials · Ethics in Clinical Research

Full text

Introduction

The evaluation of treatment response in oncology clinical trials relies on standardized criteria to ensure objective assessment of therapeutic efficacy. Blinded independent central review (BICR)—a process wherein experts masked to treatment assignment evaluate radiographic or clinical data—was developed to mitigate potential bias in investigator-based assessments, particularly in open-label trials where treatment knowledge may influence progression determinations.1^,^2 Regulatory agencies frequently mandate BICR for pivotal trials to support endpoint reliability,3^,^4 though its added value remains contentious given substantial resource demands and operational complexities.5–7

Recent meta-analyses have systematically evaluated BICR-investigator concordance patterns, primarily in solid tumors or mixed cohorts including both solid tumors and hematologic malignancies. Foundational work by Amit et al. demonstrated negligible difference in PFS assessments across 36 trials.8 This high concordance was confirmed by Russo et al. in 28 Phase III trials,4 where no significant hazard ratio differences emerged, and further supported by Jacobs et al. in metastatic breast cancer.9 While Zhang et al. observed no systematic bias in Phase III trials, they noted statistically discordant inferences in a subset of assessments.10 Even though D’Ambrosio et al. showed systematic overestimation of PFS by investigators in immunotherapy trials and Lian et al. observed this in open-label trials for the mixed cohort, the discrepancies were numerically small.5^,^11 Zettler et al. found no evidence of evaluation bias in the assessment of ORR among pivotal trials supporting recent FDA approvals of anticancer agents for solid tumor indications.12

In solid tumors, response assessments rely exclusively on predefined quantitative imaging criteria such as Response Evaluation Criteria in Solid Tumors (RECIST v1.1).13 These evaluations require subjective interpretation during lesion selection, measurement, and identification of new lesions—introducing inherent variability and potential assessment bias.5 Unlike solid tumors, hematologic malignancies employ multimodal assessment frameworks with reduced reliance on imaging for some indications, potentially reducing evaluation bias. For example, lymphoma follows Lugano criteria combining PET-CT imaging and histopathology,14 but multiple myeloma applies IMWG standards where most of response assessments derive from serum/urine parameters, especially in patients with no measurable extramedullary diseases (EMD) at baseline.15 This reliance on non-imaging data sources—particularly for acute leukemias and myeloma where clinical and laboratory parameters dominate—substantially reduces the subjectivity inherent in radiographic interpretation. Despite this potential bias reduction, hematology trials are predominantly open-label due to distinctive toxicity profiles and special administration routes. In such designs, knowledge of treatment assignment may introduce investigator bias.

To our knowledge, the concordance has not been systematically evaluated for hematologic malignancies, but only in individual trial comparisons.16 To address this gap, we conducted the first systematic evaluation of BICR-investigator concordance across Phase II/III hematology trials. We quantified agreement for progression-free survival (PFS) and objective response rate (ORR)—common primary endpoints in Phase II/III oncology trials17—assessing both treatment effect estimates and statistical inferences to evaluate BICR’s added value in hematologic malignancies. The protocol was registered in the PROSPERO database (ID CRD420251104087).

Methods

Searching strategy and selection criteria

According to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidance, we searched PubMed for publications of phase II and phase III hematology clinical trials from January 1, 2014, to November 27, 2024, using both Medical Subject Headings (MeSH) terms and free-text words (Text S1). Eligible studies included phase II or III hematology clinical trials that directly evaluated therapeutic efficacy of anticancer treatments for hematologic malignancies, with tumor response and progression assessments conducted by both investigators and BICR, and summary statistics from both assessors reported (eg, hazard ratios for PFS, proportions/cases of overall responses). While single-arm trials (SATs) and randomized controlled trials (RCTs) have distinct objectives, SATs were included in our study due to their established role in supporting regulatory accelerated approvals in hematologic malignancies, particularly in relapsed/refractory settings.17^,^18 Similarly, the ORR was included as it is a common primary endpoint in SATs. In addition, it is allowed to be used as an early endpoint in RCTs to support accelerated approval.17^,^18 We excluded subgroup analyses, follow-up analyses, and studies with fewer than 10 participants. In addition, some multi-arm RCTs compared multiple treatments A, B, against a common control C, and we considered A vs. C and B vs. C as independent comparisons. Three authors (BC, XL, and YK) independently screened titles/abstracts to assess eligibility. Full-text articles were obtained for studies meeting initial inclusion criteria. Final inclusion was determined through full-text review against eligibility criteria. Disagreements were resolved through discussion with the other team members (XT, DY, SH, YK, and HZ).

Data extraction

Data extraction was conducted independently by 3 authors (BC, XL, and YK) using a standardized form and subsequently validated by 2 additional authors (SH and XT). Any discrepancies were resolved through team consensus (XT, DY, and HZ). For each included study, the following variables were extracted: study identifiers (title, first author, publication year, National Clinical Trial [NCT] number); trial characteristics (phase, masking [open-label vs. double-blind], total sample size, sample size per treatment group [in randomized trials], cancer type, medical therapy setting, response assessment criteria, primary endpoint[s], names of treatments); and summary statistics for outcomes data, including HRs for PFSs with 95% confidence intervals (CIs) as assessed by both investigators and BICR, ORR with 95% CIs, and the number of responses (for both treatment and control groups in randomized trials, and for the single group in single-arm trials). For trials where PFS was the primary endpoint (assessed by investigators or BICR), the corresponding HR P-values and statistical significance were also extracted. Study characteristics are detailed in Table S1.

Statistical analysis

PFS analysis

The correlation between the logarithm of the hazard ratio for PFS assessed by investigators (log( $[eqn]$ ) and the logarithm of the hazard ratio assessed by BICR (log( $[eqn]$ ) was evaluated using Pearson’s correlation coefficient when normality assumptions were met based on the Shapiro-Wilk test19; otherwise, Spearman’s correlation was employed. Subsequently, we performed a meta-analysis using a fixed-effects model with inverse-variance weighting to estimate the pooled hazard ratio ratio (HRR), defined as:

[eqn]

An HRR < 1 indicates that $[eqn]$ is smaller than $[eqn]$ , suggesting a more favorable outcome for the experimental group compared to the control group based on investigator assessments. The meta-analysis was conducted on log( $[eqn]$ ), and the corresponding standard error was derived from the standard errors of log( $[eqn]$ and log( $[eqn]$ 20 If significant statistical heterogeneity was found (the P value for Cochran’s Q test was less than 0.05 or the $[eqn]$ was over 50%), a random-effects model would be adopted to take the heterogeneity into account. Subgroup analyses were conducted based on masking status (open-label vs. blinded), sample size, and cancer type. For cancer type, subgroups comprised indications requiring imaging for response assessment (eg, lymphoma and chronic lymphocytic leukemia) and indications not primarily reliant on imaging (eg, acute leukemia and multiple myeloma). The correlations were not calculated if number of studies was less than 5, and the meta-analyses were not conducted if number of studies was less than 3.

Additionally, for trials where PFS (assessed by either investigators or BICR) was the primary endpoint and the treatment comparison was formally tested against a pre-defined $[eqn]$ level, we evaluated the agreement in statistical significance between investigator and BICR assessments. The consistency in statistical inference (significant vs. non-significant) was quantified using Cohen’s kappa coefficient.

ORR analysis

To account for the different trial designs, separate analytical approaches were conducted for RCTs and SATs. For 2-arm trials, we analyzed the correlation between the logarithm of the odds ratio for objective response assessed by investigators (log( $[eqn]$ ) and the logarithm of the odds ratio assessed by BICR (log( $[eqn]$ ). We performed a meta-analysis to estimate the pooled odds ratio ratio (OddsRR), defined as:

[eqn]

where odds of response equals to ORR/(1-ORR). Hence, an $[eqn]$ > 1 indicates that $[eqn]$ is larger than $[eqn]$ , suggesting a more favorable outcome for the experimental group compared to the control group based on investigator assessment. The methods for conducting correlation analyses, meta-analyses and subgroup analyses are analogous to those for PFS described in “ PFS analysis.”

For single-arm trials with ORR as an endpoint (where OddsRR is not estimable without a control group), we calculated odds ratio of response between investigator and BICR assessment, which is defined as:

[eqn]

We synthesized the logarithm of the treatment-group odds ratio (log ( $[eqn]$ in single-arm trials. In addition, we separately pooled log ( $[eqn]$ and log( $[eqn]$ for 2-arm trials.

Reporting bias and risk of bias

To assess reporting bias, we summarized the number and proportion of studies that stated they performed both investigator and BICR assessments but failed to report data from one of these assessments. In such cases, the absence of BICR or investigator-assessed data may be due to discordance between assessments, potentially introducing reporting bias.

The risk of bias for included randomized studies was assessed independently by 2 authors (BC and SH) using the Cochrane Risk of Bias tool version 2 (RoB-2).21 Discrepancies were resolved through discussion with other team members. Assessments were classified as “Low risk,” “High risk,” or “Some concerns” across the following domains: randomization process, deviations from intended interventions, missing outcome data, measurement of the outcome, and selection of the reported result.

Results

We identified 70 eligible studies in our systematic search (Figure 1, Table S1). Because one of the trials has 2 experimental arms and a control arm, we treated the comparison between each experimental arm and the control arm as independent comparison in our analysis. In addition, we treated one trial only reporting results for 2 subgroups as 2 independent comparisons. The summary statistics of eligible comparisons of PFS analysis and ORR analysis were reported in Table 1. We included 37 comparisons for PFS analysis, 23 comparisons for 2-arm ORR analysis. Most of comparisons are open-label studies (78% in PFS comparisons and 91% in ORR comparisons), and from Phase III clinical trials (95% in PFS comparisons and 91% in ORR comparisons). In addition, we have 29 studies for single-arm ORR analysis.

Flow chart of study selection.

Agreements in HR for PFS

The correlations and results from meta-analysis were summarized in Table 2. Among 37 PFS comparisons, we observed strong correlations between log( $[eqn]$ and log( $[eqn]$ in both the overall analysis and all subgroups, with all point estimates exceeding 0.95 and lower confidence interval bounds above 0.78. Fixed-effects model was employed to conduct meta-analyses. The pooled $[eqn]$ was 0.96 (95% CI: 0.89, 1.03), which was close to 1 and indicated no statistically significant difference between $[eqn]$ and $[eqn]$ (Figure S1). In subgroup analyses, though open-label trials showed a pooled $[eqn]$ slightly below 1 while blinded trials showed a point estimate slightly above 1, neither reached statistical significance (Figure S2). In indications where image assessment only required for partial patients (EMD) for disease assessment, the subgroup analysis demonstrated a HRR point estimate of 1.00 (95% CI: 0.86,1.16). The HRR estimate in indications where imaging required for all patients was 0.95 (95% CI: 0.87, 1.03). Both estimates approximated 1 and were non-significant (Figure S3). Similarly, $[eqn]$ were comparable between trials with sample sizes $[eqn]$ versus those with $[eqn]$ patients, with both estimates being non-significant (Figure S4).

Among 33 comparisons with PFS as the primary endpoint, investigator and BICR assessments showed perfect concordance in statistical significance determinations: 26 comparisons were statistically significant by both assessments and 7 were non-significant by both assessments (Table S2, Figure 2), resulting in a Cohen’s kappa coefficient of 1. While no discordance in statistical significance was observed, small differences in point estimates could theoretically alter trial conclusions when the treatment effect is marginal (ie, HR close to 1) or the sample size is small (where the CI might be wide due to large variation). This could occur if such differences shift the upper confidence limit sufficiently to lead one assessment method’s CI to cross the significance threshold while the other’s does not.

HR estimates and 95% confidence intervals by investigator and BICR assessments.Note: The 2 points without error bars have reported significant p-values but no confidence intervals.

Agreements in ORR

Table 3 presents the meta-analysis results and correlations for $[eqn]$ analysis. Among 23 ORR comparisons, strong correlations between log(O $[eqn]$ and log( $[eqn]$ were observed in both the overall analysis and all subgroups. However, in the subgroup with a sample size $[eqn]$ 350, the confidence interval for the correlation was wide due to the small number of studies available. Among 23 two-arm trials, both the experimental group and the control group demonstrated significantly higher response rates when assessed by investigators versus BICR ( $[eqn]$ = 1.23,95% CI: 1.02, 1.48; $[eqn]$ =1.30, 95% CI: 1.09, 1.55) (Table S3, Figures S5 and S6). However, the pooled $[eqn]$ was 0.99 (95% CI: 0.85, 1.14), which was close to 1 and indicated no statistically significant difference between assessors for the treatment effect estimates (Figure S7). The combined $[eqn]$ s in all other subgroups were also close to 1 and not significant (Figures S8-S10). Among 29 single-arm trials, the pooled $[eqn]$ was 1.02 (95% CI: 0.90, 1.17), indicating a minimal and non-significant difference between odds of response assessed by investigators and BICR (Figure S11).

Assessments of bias

In our analysis, 6 of 76 trials (7.8%) were excluded due to missing BICR or investigator assessment data, despite reporting that both evaluations were conducted. Given this small proportion, the exclusions are unlikely to have substantial impact on the results.

We have 42 randomized studies in total. Using the ROB-2 tool for risk of bias assessment, we found low risk of bias in 15 (36%) of studies and some concerns in 27 (64%) of studies; no comparisons were rated high risk (Table S4, Figure S12). Most concerns arose in open-label studies, where knowledge of treatment assignment could introduce bias in outcome assessments.

Discussion

This meta-analysis represents the first systematic evaluation of concordance between investigator and BICR assessments in hematology oncology clinical trials. Our findings demonstrated high agreement in PFS and ORR assessments, with negligible differences in hazard ratio ratios (HRR) for PFS and OddsRR for ORR. Specifically, the pooled HRR of 0.96 (95% CI: 0.89, 1.03) indicates minimal systematic bias in PFS evaluations, with perfect agreement in statistical inferences (Cohen’s kappa = 1) which further supports the reliability of investigator assessments. Despite perfect agreement in statistical significance, the small differences in point estimates we observed could theoretically lead to different trial conclusions for treatments with a marginal treatment effect or in small studies with large variations. For ORR in 2-arm trials, despite this study observed investigators might tend to report more responses than BICR in both treatment and control arms, the impact on treatment effect estimates was negligible, which is shown by the pooled OddsRR of 0.99 (95% CI: 0.85, 1.14). In addition, the analysis in single arm-trials didn’t reveal the same level of favorable trend in investigator’s assessment: the pooled $[eqn]$ of 1.02 (95% CI: 0.90, 1.17) confirms the strong concordance.

Unlike solid tumors assessed solely based on imaging (eg, RECIST criteria),13 hematologic malignancies employ multimodal frameworks integrating laboratory parameters, histopathology, and clinical findings. Meanwhile, in confirming progressive disease, laboratory parameters or histopathology results often deteriorate prior to EMD progression, as observed in multiple myeloma or acute leukemias. Therefore, progressive disease is typically assessed through objective parameters, but not imaging. This different feature likely reduces subjectivity in disease assessments compared to those relying solely on the imaging evaluation, explaining the high BICR-Investigator concordance observed in hematologic malignancies. The study results confirmed this by demonstrating negligible discrepancies in both PFS and ORR assessments. Furthermore, although disease assessments in hematologic malignancies all required multimodal frameworks, the level of dependency on imaging is different. The estimates for indications where imaging is only required for partial patients showed the level of bias may be even reduced compared with those indications where imaging is required for all patients. This may further indicate that the concern of subjective bias may be reduced with the reduced level of dependency on imaging evaluation. Despite most of hematology trials were open-label, investigator and BICR assessments are highly consistent, suggesting that standardized response criteria mitigate bias even when treatment assignments are known. In addition, the agreement was maintained across difference sample sizes.

The high concordance demonstrated in this meta-analysis challenges the necessity of BICR for all patients in hematology oncology trials. The conventional rationale for BICR rests on its high reliability, achieved through blinding to mitigate assessment bias. However, this specific strength is counterbalanced by significant limitations: informative censoring occurs when disease assessments cease or new anti-cancer therapies were initiated following investigator-reported progression, potentially skewing BICR-assessed survival estimates.1^,^8^,^22–25 Furthermore, in some cases, BICR evaluations lack the comprehensive clinical context available to investigators.5 Investigators typically account for clinical factors and conduct comprehensive assessments. For instance, when imaging findings are equivocal but a patient demonstrates clinically stable or improved status relative to baseline, investigators are more likely to assess stable disease; conversely, clinical deterioration may justify progressive disease despite ambiguous imaging. In this case, this integrated approach provides a more accurate reflection of treatment efficacy, whereas BICR lacks access to patients’ clinical information. This clinical context is often essential for accurate assessment, meaning that investigator evaluations can offer a superior level of clinical validity. In addition, our findings demonstrate that the reliability of investigator assessments in this field is higher than traditionally assumed, as evidenced by the minimal discrepancy in treatment effect estimates between review methods. Consequently, the marginal benefit of BICR's reliability is offset by its operational burdens and methodological constraints, while the high validity and demonstrated concordance of investigator assessments support their use as the primary source for endpoint validation.

Hence, we recommend a more risk-based and resource-efficient approach for central review. For most hematology trials, particularly those using well-established and multimodal response criteria, BICR could be omitted without compromising validity. However, for high-risk scenarios where the potential for assessment bias or unconventional response patterns increases, such as trials of novel therapeutic classes with the risk of atypical response patterns (eg, immunotherapy with potential for pseudo-progression), trials where the primary endpoint remains highly subjective, or trials in controversial or high-stakes therapeutic areas where regulatory scrutiny is anticipated to be intense, a random sample-based BICR auditing approach could be used for clinical trial quality control, as indicated by FDA guidance.17 Implementation needs a prespecified auditing plan detailing strategies for detecting potential assessment bias and mitigation plans, a process requiring discussions among health authority, sponsors, and investigators. A full BICR may be reserved to provide a supplementary analysis of the treatment effect if a significant discrepancy rate is found in the audit. The net effect of this change is a substantial overall reduction in resource expenditure across the clinical trial portfolio.

While similar conclusions were drawn from previous solid tumor meta-analyses, the practice of BICR has persisted, likely due to regulatory conservatism and sponsor risk aversion. However, the rationale for BICR is fundamentally weaker in hematologic malignancies due to the reduced reliance on subjective imaging. Moreover, unlike in solid tumors, the operational infrastructure for BICR is less matured in hematology malignancies. Our study provides the first comprehensive and disease-specific evidence to justify a re-evaluation of this practice within hematology oncology, providing a foundation for refining clinical trial standards within this specialty.

Our study has several limitations. First, subgroup analyses—particularly for ORR comparisons in blinded studies and indications depending on imaging for partial patients—were constrained by small sample sizes. This limited our ability to robustly explore assessment patterns across different masking methodologies or specific hematologic malignancies. The limited number of blinded studies is due to distinctive drug toxicities and administration routes. Future meta-analyses should prioritize expanding cancer types to clarify disease-specific patterns. Furthermore, the independence of central review may be compromised in trials where BICR only confirms investigator-assessed progression, potentially inflating concordance. Finally, our analysis focused on aggregate trial-level treatment effects because the lack of individual patient data precluded evaluation of concordance at the patient level, such as the exact timing of progression or response status for each patient. The individual-level data can provide deeper insights into the specific causes of the small differences we observed at the trial level.

In conclusion, BICR offers limited added value in hematology oncology trials given the high concordance with investigator assessments. To optimize resource allocation, we recommend conducting central review for pivotal trials in high-risk contexts as quality control tool or supplementary analyses, storing baseline and progression images alongside clinical data to enable ad hoc review, and optimizing investigator training and response criteria standardization to further minimize assessment discrepancies. Future efforts should refine these strategies to accelerate therapeutic development while maintaining rigorous endpoint validity.

Supplementary Material

oyaf375_Supplementary_Data

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Dodd LE , Korn EL, Freidlin B, et al Blinded independent central review of progression-free survival in phase III clinical trials: important design element or unnecessary expense? J Clin Oncol. 2008;26:3791-3796. 10.1200/JCO.2008.16.171118669467 PMC 2654812 · doi ↗ · pubmed ↗
2Zhang JJ , Chen H, He K, et al Evaluation of blinded independent central review of tumor progression in oncology clinical trials: a meta-analysis. Ther Innov Regul Sci. 2013;47:167-174. 10.1177/009286151245973330227523 · doi ↗ · pubmed ↗
3U.S. Department of Health and Human Services, Food and Drug Administration. Clinical Trial Imaging Endpoint Process Standards. Guidance for Industry; 2018.
4European Medicines Agency. Guideline on the Evaluation of Anticancer Medicinal Products. 2017.
5Lian Q , Fredrickson J, Boudier K, et al Meta-analysis of 49 Roche oncology trials comparing blinded independent central review (BICR) and local evaluation to assess the value of BICR. Oncologist. 2024;29:e 1073-e 1081. 10.1093/oncolo/oyad 01236905580 PMC 11299942 · doi ↗ · pubmed ↗
6Stone AM , Bushnell W, Denne J, et al Research outcomes and recommendations for the assessment of progression in cancer clinical trials from a Ph RMA working group. Eur J Cancer. 2011;47:1763-1771. 10.1016/j.ejca.2011.02.01121435858 · doi ↗ · pubmed ↗
7Pignatti F , Hemmings R, Jonsson B. Is it time to abandon complete blinded independent central radiological evaluation of progression in registration trials? Eur J Cancer. 2011;47:1759-1762. 10.1016/j.ejca.2011.05.00921641204 · doi ↗ · pubmed ↗
8Amit O , Mannino F, Stone AM, et al Blinded independent central review of progression in cancer clinical trials: results from a meta-analysis. Eur J Cancer. 2011;47:1772-1778. 10.1016/j.ejca.2011.02.01321429737 · doi ↗ · pubmed ↗