Directing stem cell differentiation by chromatin state approximation

Luis F Montano-Gutierrez; Sophie Müller; Ana P Kutschat; Igor Adameyko; Davide Seruggia; Florian Halbritter

PMC · DOI:10.1093/nar/gkag124·February 25, 2026

Directing stem cell differentiation by chromatin state approximation

Luis F Montano-Gutierrez, Sophie Müller, Ana P Kutschat, Igor Adameyko, Davide Seruggia, Florian Halbritter

PDF

Open Access

TL;DR

This paper introduces a new method to guide stem cell differentiation by comparing chromatin states, improving the efficiency of generating specific cell types like erythroblasts.

Contribution

The novel approach uses chromatin state approximation to iteratively optimize differentiation protocols.

Findings

01

Greedy selection based on chromatin approximation improves erythroblast differentiation in vitro.

02

Analysis of chromatin regions reveals transcriptional regulators that hinder differentiation.

03

Data-driven ligand selection enhances erythropoiesis efficiency.

Abstract

A prime goal of regenerative medicine is to replace dysfunctional cells in the body. To design protocols for producing target cells in the laboratory, one may need to consider exponentially large combinations of culture components. Here, we investigated the potential of iteratively approximating the target phenotype by quantifying the distance between chromatin profiles (ATAC-seq) of differentiating cells in vitro and their in vivo counterparts. We tested this approach on the well-studied generation of erythroblasts from haematopoietic stem cells, evaluating a fixed number of components over two sequential differentiation rounds (8 × 8 protocols). We found that the most erythroblast-like cells upon the first round yielded the most erythroblast-like cells at the second round, suggesting that greedy selection by chromatin approximation can be a viable optimisation strategy. Furthermore,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Figures5

Click any figure to enlarge with its caption.

An exhaustive comparison of erythroid differentiation protocols confirms validity of greedy prioritisation by chromatin similarity to approximate optimal solution. (a) Top: Blood cells originate from CD34-positive HSCs, which branch into lymphoid and myeloid lineages. Erythroblasts originate from common myeloid progenitors (CMP) and are marked by CD36, CD71, and CD235a expression. NK, natural killer; MPP, multipotent progenitor; LMPP, lymphoid-primed multipotent progenitors; CMP, common myeloid progenitor; mono, monocyte; gran, granulocyte; mega, megakaryocyte; GMP, granulocyte-monocyte progenitor; MEP, megakaryocyte-erythroid progenitor. Bottom: Two strategies to find protocols that approximate a target cell type. Exhaustive strategies evaluate all options but are costly and their complexity is exponential. We test whether a stepwise, greedy selection can result in optimal outcomes with lower experimental burden. (b) Sorted region heatmap (DeepTools) depicting the opening of erythroblast (Ery)-linked chromatin along differentiation in haematopoietic cells. The regions shown were selected by differential accessibility analysis in erythroblasts compared to all other samples in the reference (two-sided moderated t-tests, DESeq2; P-value ≤ 0.05). (c) UMAP of the cell-type chromatin reference using cell-type-informative PCA components shown in Supplementary Fig. S1. Each point represents one independent sample. Density hulls indicate all samples from in vitro protocols (both rounds) generated in this study using different erythroid differentiation protocols (red) and the erythroblast target cells (purple). (d) Ingredient combinations (eight total) tested for Round 1 of differentiation, as a guide for data plots in panels (e) and (f). Circle size indicates the multiplication factor of a base concentration of each ingredient added (see ‘Materials and methods’). Absence of a circle indicates 0. (e) Euclidean distance of all Round-1 samples to the centroid of the Ery samples (‘erythroid distance’) using the cell-type-informative PCs (Supplementary Fig. S1). Each square represents one ATAC-seq datasets (two technical replicates from two independent experiments each). Red colour indicates the treatment with samples closest to the Ery centroid, i.e. smallest distance. The distance values are available in Supplementary Table S1. (f) Fraction of CD36 and CD71 double-positive cells from all treatments in panel (c) assessed by flow cytometry (n = 2 independent experiments). Red colour indicates the sample with the highest overall number of double-positive cells. The flow cytometry results are available in Supplementary Table S2. (g) For an exhaustive evaluation of treatments, each sample from panel (c) (treatments 1–8) was subjected to a second round of the same eight treatment combinations (treatments A–H), resulting in 64 treatment combinations (1A, 1B, etc.). (h) Erythroid distance of Round-2 samples, analogous to panel (d) (n = 2 independent experiments). Red colour indicates the treatment closest to the Ery centroid. Black dashed lines indicate the average distance for all samples coming from the same Round-1 treatment. (i) Flow cytometry analysis of Round-2 samples, analogous to panel (e) (n = 2 independent experiments). Fill colour indicates the CD71/CD36 ratio, which differs even between samples with similar amount of CD71+ CD36+ cells.

Statistical mixed-effects modelling identifies the impact of treatment ingredients on erythroid differentiation. (a) We used a mixed-effects model to quantify the relative impact of treatments on the PCA distance to the Ery centroid (cp. Fig. 1e and h). Fixed effects are EPO units at both Round 1 and Round 2 of in vitro differentiation. Negative and positive coefficients (left) indicate the estimated effect on erythroid distance per unit of treatment, respectively. The parameters of the model are available in Supplementary Table S3. (b) Intercepts for variables included as random effects in the model described in panel (a): insulin plus heparin (INS + HEP) and hydrocortisone (HYDRO) at Round 1 and Round 2. (c) Boxplot showing the median and interquartile range (IQR) of the distance to the erythroblast centroid for samples of differentiation Round 2 (n = 119 samples), divided by whether they received INS + HEP during Round 1. Whiskers extend to 1.5*IQR. All P-values correspond to a two-sided Wilcoxon rank sum test adjusted for multiple hypothesis testing using the Benjamini–Hochberg method. (d) Boxplots as in panel (c), showing samples divided by INS + HEP at Round 2.

Effects of treatment ingredients on cell-type-specific chromatin regions. (a) Heatmap showing depth-normalised counts (row-scaled) of cell-type-specific chromatin regions. Cell-type-specific chromatin modules were defined by the AUC of the ROC to be strongly predictive of each cell type in the haematopoietic reference [31] (see Supplementary Fig. S7a). Ery, erythroblast; MEP, megakaryocyte-erythroid progenitor; CMP, common myeloid progenitor; LMPP, lymphoid-primed multipotent progenitor cells; MPP, multipotent progenitor; HSC, haematopoietic stem cell. Annotated peaks are provided in Supplementary Table S4. n(peaks) = 595 523, n(samples) = 78. (b) Heatmap analogous to panel (a), showing the accessibility of samples from in vitro differentiation generated in this study. The ingredient concentrations used for each sample are indicated on top. Selected blocks of coordinates chromatin changes have been highlighted manually with black boxes, as described in the text. n(samples) = 151. (c) Enriched TF binding motifs of each cell-type-specific chromatin module relative to all consensus peaks found in the reference dataset. Regions with >0 motif hit for a TF were counted for all regions in each module (circle size), compared to the background (fold change indicated by colour). The top five enriched TFs were selected for visualisation and all enrichment results (hypergeometric test) are listed in Supplementary Table S6. (d) Barplots of relative module scores of the cell identity modules in panel (a). Scores were calculated across all samples using single-sample gene set variation analysis [45] (GSVA). All scores in the barplots were scaled such that a score of 0 matched the average of HSCs and 1 matched the average score for the original cell type (see the ‘Materials and methods’ section). Error bars correspond to the mean ± the standard deviation for each subgroup. (e) Biplots showing contrasting the effect of hydrocortisone (HYDRO) in Round 1 on Ery module score and the CMP module score. Plotting symbols depict a different subgroup of samples, with symbols indicating the cell type and colour indicating the treatment status (for in vitro samples only). n(samples) = 183. (f) Biplot analogous to panel (e) showing the effect of insulin + heparin (INS + HEP) in Round 2 on the Ery and MEP module scores. (g) Biplot showing the effect of insulin + heparin (INS + HEP) in Round 2 on the Ery and B-cell module scores.

Mismatches persist in chromatin adaptation to erythroblasts even for best-performing protocols. (a) Sorted region heatmaps comparing accessibility between erythroblast samples [Ery; one representative sample from [31] and two replicate samples after differentiation with the best available treatment in vitro (treatment 5E = high EPO over two rounds)]. Each block corresponds to cell-type-specific chromatin modules that have been arranged by likeness to the Ery reference into concordantly open regions (top; identical to Ery-specific chromatin module), discordantly open (middle; MEP- and non-Ery regions), and concordantly closed (bottom; HSPC and lymphocyte-specific regions; see Supplementary Fig. S8a). Concordant regions (top and bottom) match between Ery and 5E, whereas discordant regions (middle) remain erroneously accessible in treatment 5E. (b) Direct comparison of the percentage of regions with >0 TF motifs from the HOCOMOCO [44] v11 database in the concordant (x axis) versus the discordant (y axis) regions. Colour represents the log10P-value of a two-sided Fisher test calculated on a two-dimensional contingency table for that motif (number of motif hits in 4872 discordant regions versus 4218 concordant regions). We label TFs with motifs absolute percentage changes ≥0.15, or those uniquely enriched in the discordant regions (percentage ≥40%), which were used for the enrichments below. The motif frequencies in concordant and discordant regions are provided in Supplementary Table S7.

Data-driven selection of protocol modifications to correct chromatin mismatches between in vitro and in vivo cells. (a) Gene set overlap analysis comparing TFs associated with the discordant-enriched motifs in Fig. 4b with genes affected by perturbations ligand perturbations from the Enrichr database (collections: ‘Ligand_Perturbations_from_GEO_down/up’) using hypeR (background: all 676 motifs in the HOCOMOCO database; all tested gene sets were restricted to genes in this background). Gene sets were ranked by the Jaccard index. Overlap results are provided in Supplementary Table S8. (b) Schematic overview of validation experiments for data-driven improvements to Ery differentiation based on the enrichment results in panel (a). We tested the original protocol (8G), the best-performing protocol (5E), in comparison to a modified version of protocol 5E with the addition of IL-4 or TGF-β in both rounds of differentiation (IL-4_1, TGF-β_1) or only in Round 2 (IL-4_2, TGF-β_2). (c, d) Flow cytometry results comparing Ery markers at Round 1 (day 7, panel c) and Round 2 (day 11, panel d) of differentiation. Each panel shows, in order: the percentage of CD71/CD36-double-positive cells, of CD36-positive cells, of CD71-positive cells, and of CD71/CD235a-double-positive cells. Statistical analyses using ordinary one-way ANOVA. *P < .05; P < .01; *P < .001; P < .0001. (e) RT-qPCR analysis of HBB expression at day 11 and 14. Bars represent means and standard deviations of three biological replicates. Statistical analyses using ordinary one-way ANOVA. P < .01. (f) Quantification of enucleated cells as Hoechst/PI double-negative cells by flow cytometry. Bars represent means and standard deviations of three biological replicates. Statistical analyses using ordinary one-way ANOVA. *P < .001. (g) Wright–Giemsa of cytospin preparations of cells at day 11 of erythroid differentiation. Late erythroblast cells are marked with an arrow. (h) Quantification of late erythroblast cells upon treatment with TGF-β or IL-4. Bars represent means and standard deviation of three biological replicates. One thousand cells were counted and staged per replicate. Statistical analyses using ordinary one-way ANOVA. *P < .05. (i) Distribution of log2-fold changes per peak group (compared to all other peaks) in each adapted protocol compared to treatment 5E. Green and blue tones correspond to TGF-β and IL-4 treatment, respectively. Filled and dotted lines correspond to the treatment applied in both rounds (TGF-β_1, IL-4_1), or only at Round 2 (TGF-β_2, IL-4_2), respectively.

Funding1

—FWF10.13039/501100002428

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Chromatin Dynamics · Pluripotent Stem Cells Research · Epigenetics and DNA Methylation

Full text

Introduction

The overarching goal of regenerative medicine is to replace diseased or damaged tissues with cells generated in a laboratory [1, 2]. To minimise risks and maximise benefits, it is important to produce cells that are phenotypically equivalent to their in vivo counterparts [1, 3]. Moreover, cell production should be reproducible across laboratories and have a high yield [2, 4]. A promising strategy is to direct the differentiation of stem cells using different types of external stimuli (chemical, physical) or genetic perturbations [5, 6]. However, some cell types remain difficult to engineer in sufficient quality or quantity due to incomplete acquisition of the target phenotype, persistence of progenitor traits, or heterogeneity in the generated cell populations [3, 7]. Moreover, cell culture components can be expensive and differentiation yields low, rendering cell production economically burdensome [8]. Therefore, strategies to optimise, standardise, and reduce the cost of cell differentiation are highly sought after.

There are at least two technical challenges to the design of differentiation protocols. First, the number of culture components, concentrations, and application times is large and with each additional variation considered the number of possible protocols becomes exponentially larger. Without automation and multiplexed readouts, exhaustive assessment of all alternatives quickly becomes infeasible for most laboratories [9–13]. Alternatively, the number of combinations to be evaluated may be reduced by experimental design, e.g. by applying mathematical modelling or engineering-inspired methodologies such as the Taguchi method for robust parameter design [13–17]. These complex designs can be difficult to implement without bespoke hardware[18] and expertise, thus leading us to favour simple approaches. Borrowing a concept from machine learning [19], a ‘greedy’ optimisation strategy may be an appealing solution to this problem. Greedy optimization evaluates a fixed number of alternatives at each step and picks the option that is closest to the target as a basis for subsequent iterations. Implementing this approach to protocol optimisation could decrease the experimental burden by reducing the complexity of the search problem from exponential to a linear one, which has been useful to refine differentiation strategies in primate stem cells [20]. However, if intermediate measurements are misleading, greedy selections may yield suboptimal solutions. The second technical challenge is how to assess the acquisition of the desired target cell identity. The typical approach is to evaluate assorted molecular markers via imaging, flow cytometry, or real-time quantitative PCR (RT-qPCR). These assays are inexpensive and readily accessible at most institutions but provide only an incomplete assessment of the underlying phenotype, especially at intermediary differentiation stages. Transcriptomics assays have been widely adopted to assess phenotypes during differentiation [21–24]. Combined with systems biology and network analysis, this allows not only to accurately assess the fidelity of differentiated cell types [3, 7, 25] but also to identify lineage-determinant transcription factors (TFs) and signalling events [26–29] and thus a data-driven preselection of potential treatment targets for differentiation protocols [30]. In practice, gene expression may be affected by circumstantial effects like proliferation state or environment, thus distracting from the key phenotype to be examined and contributing to so-called ‘batch effects’. In contrast, epigenomic features like DNA methylation, chromatin accessibility, and chromatin conformation are cell-type-specific and can be measured effectively using genome-wide assays (e.g. ATAC-seq) [31–33]. Previous work demonstrated that chromatin accessibility profiles are highly informative about cell types [31], that changes may even be detected earlier than at the RNA level while being stable against transient fluctuations, and that analysis of chromatin allows identifying cell-state-defining regulatory elements [34–38]. The global epigenomic similarity may thus be a good proxy for the convergence towards a desired target cell type.

Here, we apply greedy optimisation logic to cell differentiation, using the haematopoietic system as a proof of principle. We take advantage of the iterative, multi-step nature of differentiation protocols, and use flow cytometry and ATAC-seq to quantify the phenotypic distance between the differentiating cells and the target, before the next iteration. By subjecting haematopoietic stem cells (HSCs) to two rounds of eight treatments each to produce erythroblasts (Ery), we show that the winning condition in the first round predicted the winners of the second round. Our results suggest that erythropoietin (EPO) alone produces the best-matching erythroblasts, and that insulin (INS), heparin (HEP), and hydrocortisone (HYDRO), used in several protocols, may have an adverse, time-sensitive impact. Using machine learning, we reveal spurious accessibility of chromatin at binding sites of TFs such as STAT1, ZNF394, and IRF1/3, which are normally closed in reference erythroblasts. Finally, we find data-driven perturbations to target the activity of such TFs in future experiments. To facilitate the optimisation of other differentiation protocols, we provide a computational pipeline that performs the analysis and generates a report for test samples in comparison to a defined target cell type.

Materials and methods

Human erythroid cell culture

G-CSF-mobilized human peripheral blood CD34+ cells (Stem Cell Technologies) were differentiated into immature erythroid cells utilizing a three-phase culture protocol. For standardisation, we used the same batch of cells for all experiments described in this study. Briefly, cells were thawed and cultured overnight at a density of 10^4^ cells/ml in X-VIVO 10 serum-free haematopoietic cell medium (Lonza) supplemented with 100 ng/ml recombinant human SCF (Biotechne), 100 ng/ml recombinant human TPO, and 50 ng/ml recombinant human FLT3 Ligand (PeproTech). The next day, cells were washed and resuspended at a density of 1–2.5 × 10^4^ cells/ml into erythroid differentiation medium 1 (EDM1) consisting of Iscove's Modified Dulbecco's Medium (IMDM) supplemented with 100 ng/ml hSCF, 330 μg/ml human holo-transferrin, 5 ng/ml recombinant human IL-3 (PeproTech), 1% penicillin/streptomycin, 5% heat-inactivated human serum (Sigma), and 1 or 3 IU human recombinant EPO (Stem Cell Technologies). When it is indicated, 1 μM hydrocortisone, 10 μg/ml recombinant human insulin, and 2 IU/ml heparin (Sigma) were added as well. From day 7 to day 11, cells were cultured at a concentration of 10^5^ cells/ml in EDM2 (EDM1 without hIL-3). From day 11 to day 14, cells were cultured in EDM3 (EDM2 without hSCF) at a concentration of 0.75–1.5 × 10^6^ cells/ml in the presence of 1 mg/ml holo-transferrin. For validation experiments, CD34+ cells were differentiated into immature erythroid cells as described above, with the addition of 0.1 ng/ml human TGF-β1 recombinant protein (PeproTech) or 10 ng/ml human recombinant IL-4 (PeproTech) in EDM1 and EMD2 when indicated. Cells were cultured at 37°C and 5% CO_2_. These materials are listed in Supplementary Table S10.

Flow cytometry

Cells were washed in phosphate buffered saline (PBS) supplemented with 1% fetal bovine serum (FBS) and incubated with antibodies for 30 min at 4°C. The following antibodies were used: FITC-conjugated CD71 antibody (clone OKT9, eBioscience, 1:40 dilution), APC-conjugated CD36 antibody (clone 5-71, Biolegend, 1:40 dilution), and BV421-conjugated CD235a antibody (clone HIR2 GA-R2, BD, 1:20 dilution). Cells were stained using 1 µg/ml Hoechst 33342 (#H3570, Thermo) for 20 min at 37°C, followed by staining with 4 µg/ml Propodium Iodide (#81845, Sigma) for 5 min at 4°C to assess enucleation. Flow cytometry was conducted on a BD Bioscience Fortessa instrument and analysed using FlowJo software v10.10.

RNA isolation and RT-qPCR

RNA was harvested from 1–2 million cells using the RNeasy Plus Mini Kit (Qiagen, #74134). Five hundred nanograms of RNA was used for complementary DNA (cDNA) synthesis using the iScript cDNA Synthesis Kit (Bio-Rad, #1708891). qPCR was performed using the iQ5 SYBR Green Supermix (Bio-Rad, #1708885) and ran in the CFX Connect Real Time System (Bio-Rad). An initial denaturation at 95°C for 3 min, followed by 40 cycles of 10-s denaturation at 95°C and 30-s annealing and extension at 60°C was performed using the following oligonucleotides for HBB (CTGAGGAGAAGTCTGCCGTTA; AGCATCAGGAGTGGACAGAT) and RPLP0 (GATTGGCTACCCAACTGTTG; CAGGGGCAGCAGCCACAAA).

Wright–Giemsa staining

Approximately 0.1 million cells were immobilised onto microscope slides using a low-speed centrifuge (Hettich Rotofix) and a funnel at 260 rpm for 8 min. Next, the funnel and media were removed, and slides were spun for another 4 min at 330 rpm. The cells were then fixed in methanol for 1 min, immerged in modified Wright–Giemsa (WG16, Sigma) for 2 min, followed by incubation in phosphate buffer (pH 6.8; #1113740100, Merck) for 4 min. Finally, slides were rinsed in ddH20 and air-dried. Slides were imaged using a ZEISS Axio Imager Z2 using 20× and 40× objectives. One thousand cells for each slide were manually counted and scored according to morphology.

Chromatin accessibility analysis (ATAC-seq)

Library preparation and sequencing

For ATAC-seq, 50 000 cells were collected at day 7 and day 11, in biological duplicates. Cells were washed using cold PBS and subsequently lysed in cold lysis buffer (10 mM Tris–HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% IGEPAL CA-360). Immediately after lysis, nuclei were collected by centrifugation, and the pellet was resuspended in 50 μl transposase reaction mix (Illumina Nextera DNA Preparation kit). The transposition reaction was carried out at 37°C for 30 min. Following transposition, the samples were purified and eluted in 10 μl Elution Buffer (Qiagen Minelute PCR purification kit). Libraries were amplified by polymerase chain reaction (PCR) for a minimum of five cycles. Additional cycles were added to reach exponential amplification according to qPCR. Oligonucleotides are provided in Supplementary Table S11. Libraries were purified and quantified using a Qubit dsDNA HS Assay kit (Invitrogen). Before pooling, the quality of the library was determined by fragment analysis on a TapeStation 4200 (Agilent Technologies) using a High Sensitivity D1000 ScreenTape. Samples were sequenced on a NovaSeq X Plus to obtain at least 30M PE150 reads per library. Supplementary Table S1 provides an overview of all sequence data and associated quality metrics.

Data preprocessing and quality control

Bioinformatic pipelines were obtained using nextflow (v22.10.6) nf-core (v2.14.1) [39]. Raw chromatin accessibility data from purified cell types in the bone marrow was obtained from a previously published work [31], GEO (GSE74912) using the nexftlow (v23.10.1) nf-core pipeline nf-core/fetchngs (v1.10.0). For alignment, peak calling, quality control, and accessibility count matrices, the reference and test datasets were processed separately using the nf-core/atacseq pipeline (v2.1.2) (directives: --broadpeak, --aligner bwa, --macs_gsize 2749859687, iGenomes hg38). To obtain accessibility counts in the reference regions for both datasets, test reads were recounted on the reference consensus peak set using a custom workflow employing the nf-core module subread/featureCounts. Samples with a fraction of reads in peaks (FRiP) score ≥0.05 were kept for downstream analysis. The coordinates and annotations of peaks used in this study are provided in Supplementary Table S4.

Normalisation and selection of cell-type-informative PCA components and calculation of distances to target

All subsequent analyses were implemented in R (v4.1.3) and Bioconductor (v3.14). Depth-normalised counts on 595 523 regions for all samples were obtained for one reference dataset and two test datasets using DEseq2 [40] (v1.34.0).

Normalised counts for the reference cell types were subjected to principal component analysis (PCA). Loadings for the reference samples on 80 principal components (PCs) were then used to project the test samples onto this PCA space. Reference sample PCA embeddings were used to train a random forest multiple classification model using the R library RandomForest (v4.7-1, ntrees = 500). Classification error estimations, variable importance, and cross-validation were performed by performing out-of-bag analysis while building the decision trees. Components with a mean decrease in Gini index of ≥2 (high-importance PCs) were kept for a second round of random forest to quantify the decrease in classification error. High-importance PCs were used to calculate the Euclidean distance between each sample and the centroid (multidimensional mean) of erythroblasts.

Computational down-sampling analysis

To simulate the effects of varying sequencing depth in the test and reference dataset, we calculated the average sequencing depth of the test dataset (our ATAC-seq data) and the reference haematopoietic dataset and randomly down-sampled read counts assigned to each peak to defined read depths (0.5, 1, 5, 10, 15, 20, 25 million reads for the test data; 1, 10, 20 million reads for the reference) as a fraction of the original read depth. We performed 250 random trials and evaluated the chromatin distances to the erythroid sample in all pairwise comparisons of test and reference read depth after selecting informative PCs as described above.

Evaluation of chromatin distance in other haematopoietic lineages

Sequencing read datasets comprising in vitro samples from the respective publications [41–43] were downloaded and processed using the nextflow nfcore/atacseq pipeline, and subsequently reads in the reference peaks were counted using our custom nextflow workflow with the same parameters as described above. Instead of projecting each dataset to the reference PCA, a joint PCA with all the datasets was performed, which mitigated issues due to dataset quality control differences. For the calculation of the chromatin distance, high importance PCs (see above) with mean decrease Gini index above 1.2 were used (PC2, PC13, PC7, PC4, PC16, PC10, PC17, PC5, PC6, PC14, PC11, PC8, PC3, PC9, and PC12) to calculate the Euclidean distance between each sample and the PCA centroid of the samples the target cell type.

Selection of chromatin cell identity modules using the receiver-operating characteristic

Chromatin regions with the highest loadings were collected from the first 10 PCs of the dataset and the high-importance PCs (see previous section). For each PC, regions were ranked a lower limit in the loading was set by a heuristic elbow metric (Supplementary Fig. S7a). Within this subset, up to 8000 peaks per PC were scrutinised exhaustively for cell-type signals using receiver-operating characteristic (ROC) analysis. ROC curves and their respective area under the curve (AUC) were then calculated for each high-loading region and cell type using the normalised accessibility counts using the R package pROC (v1.18.0). Regions with an AUC ≥0.95 are used as cell identity chromatin modules and regions with AUC ≤0.05 are also kept as distinctively closed for that cell type. Test samples from in vitro differentiation were left out of the ROC computation. AUC values for all peaks are provided in Supplementary Table S4.

Mixed-effects modelling of treatment effects

Mixed-effects modelling was performed in R using the function lmer from the package lme4 (v1.1-29). The model evaluated the contributions of fixed effect predictors (1.EPO and 2.EPO) and random effect predictors (1.HYDRO, 2.HYDRO, 1.INS + HEP, 2.INS + HEP) on the response (the distance to the Ery centroid). The intercepts and coefficients of the model are provided in Supplementary Table S3.

Transcription factor binding motif counting and enrichment analysis

TF motifs in all regions were counted R package motifmatchr (v1.16.0) using position-specific scoring matrices from the HOCOMOCO database [44] (v11). A region with >0 motif hits for a TF was considered positive. Then the number of positive regions in a selected groups was compared to the background in question. In the case of TFs enriched in cell-type-specific regions, the background was all the regions. In the case of the discordant versus concordant regions, the background was the concordant regions. Fold-enrichments were calculated using the formula log_2_ (number of positive regions in sample/ number of positive regions in background). Statistical assessment of the enrichments was performed via a right tailed hypergeometric test (finding the probability of sampling X or more positive regions without replacement in a sample of n regions given an expected number of p(n) positive regions) and a Fisher test on the contingency matrix. Probabilities were then adjusted for multiple testing using the Benjamini–Hochberg procedure. Motif enrichment results are provided in Supplementary Tables S6 and S7.

Chromatin cell identity module scoring and analysis

Single sample gene set variation analysis scoring was performed using the method ‘ssgsva’ in from the R package GSVA [45] (v1.42.0). For each cell type, cell identity regions were treated as genes for this analysis and scores were calculated from the normalised count matrix for all samples including the test (Fig. 3e). To obtain a relative score (Fig. 3d), the average score for HSC samples was subtracted and the value was divided by the average score of the query cell type. For example, to calculate the B cell module score for an Ery sample, the average B cell score for the HSCs was first subtracted from all samples and then divided by the B cell score of B cells. Negative scores were set to zero. The HSC module score is not defined for all samples using this formula.

Gene set enrichment analysis of closest transcription start sites

The closest transcription start site for the regions in the analysis was found using the GenomicRanges package (v1.46.1), specifically the function distanceToNearest (ignore.strand = T). The resulting genes per group were submitted to a gene set enrichment analysis using the package hypeR [46] (v1.10.0, test=”hypergeometric”) versus different gene sets and were adjusted for multiple testing using the Benjamini–Hochberg procedure, then filtered on P-value (P ≤0.05) and sorted by overlap using the Jaccard index (size of the gene list intersection/size of the gene list union). Gene set enrichment results are provided in Supplementary Tables S5 and S8.

Generation of sorted region heatmaps

Sorted region heatmaps were generated with the Deeptools package [47] (v3.5.4), specifically the tools computeMatrix (reference-point, -a 3000, -b 3000, --sortRegions descend, -bs 25, --missingDataAsZero, --referencePoint = center) and plotHeatmap (--zMin 0, --zMax 2, --yMin 0, --yMax 1). The reference sample for sorting (parameter: -S) was always the leftmost one in the plots.

Concordant and discordant region analysis

This analysis compared the opening and closing of peaks in the best protocol samples (those of 5E) relative to a true erythroblast sample (SRX1427827), as shown in Fig. 5a. To separate concordant and discordant regions, all regions that could be informative of cell fidelity were used. This involved cell identity regions for all cell types (ROC AUC ≥ 0.95) and the negative cell identity regions for Ery (ROC AUC for Ery ≤0.05). To simplify clustering (which performed poorly with all the peaks together), blocks of peaks were visually split into mostly concordant open (all Ery peaks), concordant closed and mostly discordant [Ery negative peaks plus megakaryocyte-erythroid progenitor (MEP) peaks]. Then, to further refine the discordant peak block, discordant regions were re-sorted and clustered using DeepTools plotHeatmap (--hclust 4) using the aforementioned Ery sample as a sorting and clustering reference. This yielded three small clusters with concordant patterns and one large discordant cluster, which was then used as the discordant peaks. Similar treatment of the concordant or discordant groups yielded no obvious improvement on the group quality and thus both groups were kept according to the visual split.

Analysis of chromatin changes in adapted differentiation protocols

To compare the adapted protocols (TGF-β_1, TGF-β_2, IL-4_1, IL-4_2) versus the previously winning protocol (5E), we used DESeq2 (v1.34.0). Each condition was compared individually versus the samples of 5E. To call differentially accessible peaks, we tested only concordance peaks (concordant open/closed and discordant) with five or more reads in at least two samples of at least one condition. Peaks with an adjusted P-value ≤0.05 (using Benjamini–Hochberg correction) were reported as significant (Supplementary Table S9).

Results

We sought to evaluate the potential of greedy optimisation for cell differentiation protocols by chromatin approximation using the example of erythroblast production from HSCs. Published protocols for erythroid differentiation include hormones and cytokines such as EPO, interleukin 3 (IL-3), insulin (INS), heparin (HEP), and hydrocortisone (HYDRO) [48, 49]. Differentiation can be monitored based on loss of stem cell marker CD34, a gain and subsequent decline of CD36 (thrombospondin receptor) and CD71 (transferrin receptor), and finally a gain of CD235a (glycophorin A). As a baseline for treatment exploration, we took a well-established protocol [50] that consists of three rounds of treatments (hereafter referred to as Round 1–3; Fig. 1a). We focused on variations of 3 culture components in the first 2 rounds (followed my maturation in Round 3): the concentration of EPO (1×, 3×), the combined presence/absence of INS + HEP, and the presence/absence of HYDRO, yielding 8 ingredient combinations per round, and 72 treatments in total (8 at end of Round 1, plus 64 at Round 2). Thus, by evaluating all treatments, we sought to learn whether an optimum at Round 1 (the greedy choice) on its own would lead to the global optimum (the best overall) at Round 2. Throughout in this paper, we indicate the round of each treatment with the prefixes ‘1.’ and ‘2.’. For example, addition of insulin and heparin to Pound 2 is indicated as 2.INS + HEP.

An exhaustive comparison of erythroid differentiation protocols confirms validity of greedy prioritisation by chromatin similarity to approximate optimal solution. (a) Top: Blood cells originate from CD34-positive HSCs, which branch into lymphoid and myeloid lineages. Erythroblasts originate from common myeloid progenitors (CMP) and are marked by CD36, CD71, and CD235a expression. NK, natural killer; MPP, multipotent progenitor; LMPP, lymphoid-primed multipotent progenitors; CMP, common myeloid progenitor; mono, monocyte; gran, granulocyte; mega, megakaryocyte; GMP, granulocyte-monocyte progenitor; MEP, megakaryocyte-erythroid progenitor. Bottom: Two strategies to find protocols that approximate a target cell type. Exhaustive strategies evaluate all options but are costly and their complexity is exponential. We test whether a stepwise, greedy selection can result in optimal outcomes with lower experimental burden. (b) Sorted region heatmap (DeepTools) depicting the opening of erythroblast (Ery)-linked chromatin along differentiation in haematopoietic cells. The regions shown were selected by differential accessibility analysis in erythroblasts compared to all other samples in the reference (two-sided moderated t-tests, DESeq2; P-value ≤ 0.05). (c) UMAP of the cell-type chromatin reference using cell-type-informative PCA components shown in Supplementary Fig. S1. Each point represents one independent sample. Density hulls indicate all samples from in vitro protocols (both rounds) generated in this study using different erythroid differentiation protocols (red) and the erythroblast target cells (purple). (d) Ingredient combinations (eight total) tested for Round 1 of differentiation, as a guide for data plots in panels (e) and (f). Circle size indicates the multiplication factor of a base concentration of each ingredient added (see ‘Materials and methods’). Absence of a circle indicates 0. (e) Euclidean distance of all Round-1 samples to the centroid of the Ery samples (‘erythroid distance’) using the cell-type-informative PCs (Supplementary Fig. S1). Each square represents one ATAC-seq datasets (two technical replicates from two independent experiments each). Red colour indicates the treatment with samples closest to the Ery centroid, i.e. smallest distance. The distance values are available in Supplementary Table S1. (f) Fraction of CD36 and CD71 double-positive cells from all treatments in panel (c) assessed by flow cytometry (n = 2 independent experiments). Red colour indicates the sample with the highest overall number of double-positive cells. The flow cytometry results are available in Supplementary Table S2. (g) For an exhaustive evaluation of treatments, each sample from panel (c) (treatments 1–8) was subjected to a second round of the same eight treatment combinations (treatments A–H), resulting in 64 treatment combinations (1A, 1B, etc.). (h) Erythroid distance of Round-2 samples, analogous to panel (d) (n = 2 independent experiments). Red colour indicates the treatment closest to the Ery centroid. Black dashed lines indicate the average distance for all samples coming from the same Round-1 treatment. (i) Flow cytometry analysis of Round-2 samples, analogous to panel (e) (n = 2 independent experiments). Fill colour indicates the CD71/CD36 ratio, which differs even between samples with similar amount of CD71+ CD36+ cells.

Projection to a haematopoietic cell-type reference reveals sequential approximation of the erythroblast phenotype and allows selection of high-performing protocols

We started by defining the target cell type to be approximated at the molecular level as a baseline to quantify the fidelity of cells differentiated in vitro. Previous research provided chromatin accessibility maps that distinguish major haematopoietic cell types, thus providing a suitable ground-truth reference for this purpose [31] (Fig. 1b). A PCA of the consensus chromatin peaks among all reference samples revealed that information about cell identity was not evenly distributed across all PCs of the variance (Supplementary Fig. S1a). For example, PC1 separated cell types less clearly than PC3. To focus on features with a high cell-type signal-to-noise ratio, we trained a multiclass random forest classifier on all PCs to classify all reference cell types and selected the most important PCs for the classification task (Supplementary Fig. S1b). Performing the same classification with this shortlist of cell-type-informative components reduced the classification error rate from ~25.6% to ~11.5%. Most misclassifications came from developmentally related cell types such as HSCs and multipotent progenitors, and different T cell subtypes. Using cell-type-informative PCs, we built a UMAP to visually summarise the cell-type information in these cell-type-informative PCs (Fig. 1c).

Having defined the haematopoietic PCA space, we aimed to find treatments that produced cells with the closest resemblance to the reference erythroblasts. To find a high-performing differentiation strategy, we evaluated chromatin accessibility using ATAC-seq in duplicates (Supplementary Fig. S2a–d; overview and qual of all genomics data: Supplementary Table S1) and the number of CD36+ CD71+ double-positive cells using flow cytometry (Supplementary Fig. S3a and b, and Supplementary Table S2). After confirming data were of satisfactory quality (average FRiP score = 0.24; Supplementary Table S1 and Supplementary Fig. S2a) and reproduced expected chromatin accessibility patterns at the well-described HBB, but not the foetal and embryonal beta globin genes HBG1/2 and HBE (Supplementary Fig. S2b and c), we visually inspected the results of the differentiation experiments by projecting the ATAC-seq data to the haematopoietic reference and observed that most treatments appear globally close to the reference erythroblasts (Fig. 1c and Supplementary Fig. S1a), with erythroblast-linked chromatin regions opening (Supplementary Fig. S2d). To evaluate treatment performance more objectively, we calculated the Euclidean distance in PCA space between each sample and the centroid of all erythroid samples. We reasoned that this distance provided a unidimensional, interpretable objective function to minimise as the samples converged towards the erythroid distribution. PCA distance comparisons revealed treatments that helped cells approach the average erythroid chromatin profile (i.e. they reduced the PCA distance), whereas others failed to match some Ery chromatin features and increased the distance (Fig. 1e). At the end of Round 1, we identified samples from treatment 5 as the closest to the erythroblast centroid. This treatment, which consisted of only high EPO, also yielded the highest number of CD36+ CD71+ double-positive cells (Fig. 1f and Supplementary Fig. S3b). This consistency suggests high concordance between conventional metrics of cell fidelity and our chromatin distance metric. Thus, according to the greedy search principles, treatment 5 would be the one elected forward for Round 2.

To assess whether the greedy choice of treatment 5 would eventually yield the best erythroblast-like samples, we then evaluated all samples corresponding to differentiation Round 2 (Fig. 1g–i). Treatments building on top of treatments 5 (5A–H) and 7 (7A–H) from Round 1 produced the highest percentage of CD71+ CD36+ cells (~90%), however, the former had a higher CD71/CD36 ratio (Fig. 1i and Supplementary Fig. S3c and d), which suggests further maturation [51]. Concordantly, treatments 5A–H also had a smaller PCA distance to the centroid of all erythroid samples (called ‘erythroid distance’ henceforth) than 7A–H (mean distance 473.9 versus 489.1; Fig. 1h, black dashed lines). Treatment 5E (Fig. 1h and i, and Supplementary Fig. S3e and f; highlighted in red), produced samples with a smaller erythroid distance than all others, thus representing the overall best treatment for erythroblast production in our hands.

We sought to test the robustness of our greedy selection to technical variation. To this end, we performed a computational simulation in which we varied the quality of the test and reference samples by down-sampling read counts assigned to each peak thus emulating effects of different sequencing depths. We found that erythroid distances calculated on down-sampled datasets were highly correlated with the original metric down to a sequencing depth of only ~5M reads in the test dataset, if the reference was of a sufficient quality (≥10M reads; Supplementary Fig. S4a and b). With low depth in the reference dataset, results became unstable with a high variation between random trials (Supplementary Fig. S4b). This is due to a decreasing margin between the erythroid distance of the winning sample (= lowest erythroid distance) and all other samples (Supplementary Fig. S4c), resulting in a variable assignment of winning treatment across trials (Supplementary Fig. S4d). However, at ≥15M average depth of the test data, treatment 5 was consistently picked as the winner in all trials (Supplementary Fig. S4e). Treatment 5E was most frequently chosen as the winner of Round 2, but at low to medium sequencing depths (<20M), other protocols were sometimes favoured including 7A (the second-best treatment in our main analysis).

Furthermore, we tested whether chromatin distances were informative in other differentiation lineages by reanalysing public ATAC-seq data from three in vitro differentiation studies. First, we confirmed that the lowest erythroid distance correlated with grades of ongoing maturity defined by FACS markers in independently performed experiments based on independent protocols and other biological source materials [43] (Supplementary Fig. S5a). Second, we found that chromatin distance to monocytes in our reference atlas (‘monocyte distance’) was lowest in HL60 cells differentiated using a monocytic differentiation protocol compared to those subjected to macrophage or neutrophil maturation [42], supporting that our concept for protocol selection might be generalisable to other lineages (Supplementary Fig. S5b). Third, we explored distance to NK cells in our references based on in vitro-differentiated HSPCs [41]. In this analysis, chromatin-based ‘NK cell distance’ varied strongly with only a marginal decrease throughout differentiation (Supplementary Fig. S5c). NK, CD4+, and CD8+ T cells are similar at chromatin level (Supplementary Fig. S1), which might be a reason why chromatin distance is less informative in this instance.

Our results confirm that a greedy choice during the first round of differentiation could effectively lead to the overall best-performing treatment, and thus that greedy treatment searches are a viable way to optimise differentiation. However, caution is advised when applying the approach in other lineages, where the distinguishability of cell types in the reference might be one indicator of cases where chromatin distance is not a sufficient indicator of differentiation progress.

Combined insulin and heparin induce time-dependent effects on chromatin during erythroid differentiation

To explain why some samples were closer to the erythroid centroid than others, we further scrutinised how the PCA distance was affected by the treatments by quantifying the effects of each ingredient on the distance metric via a mixed-effects model. Briefly, this approach explains variance in the chromatin distance as a sum of biases, each of them brought by one ingredient. These effects are allowed add onto each other or cancel each other out. Thus, the model estimates which linear coefficients for every ingredient would create a best fit for the observed distances and therefore explains treatment diversity in our data. Having all possible combinations of treatment components allows to interrogate the effects of each ingredient.

Upon modelling the effects of individual treatment components over 64 protocols, we found that EPO had the strongest positive effect on differentiation outcome (Fig. 2a and Supplementary Fig. S6a). According to the model, each unit of EPO reduced the distance to the erythroid centroid by a factor of ~7.8 on average (Supplementary Table S3). In contrast, we found that incorporation of INS + HEP had a stark hindering effect, particularly in Round 2 (2.INS.HEP); its addition increased the PCA distance by ~30 points on average (Fig. 2b). Inspection of the data confirmed a significant association of 2.INS + HEP with large erythroid distance (P = 1.5 * 10^−15^, two-sided Wilcoxon rank sum test, P-values adjusted for multiple hypothesis testing using the Benjamini–Hochberg method), while 1.INS + HEP had only a marginal effect (Fig. 2c and d). That is, the effect of INS + HEP during the first differentiation round was eventually overwritten by the treatment in round two. In contrast, the subtle effect of high EPO in Round 1 prevailed independent of ensuing treatment choices (Supplementary Fig. S6a and b). HYDRO had diverging effects at early and late stages, but neither of those effects were significant (Fig. 2b and Supplementary Fig. S6c).

Statistical mixed-effects modelling identifies the impact of treatment ingredients on erythroid differentiation. (a) We used a mixed-effects model to quantify the relative impact of treatments on the PCA distance to the Ery centroid (cp. Fig. 1e and h). Fixed effects are EPO units at both Round 1 and Round 2 of in vitro differentiation. Negative and positive coefficients (left) indicate the estimated effect on erythroid distance per unit of treatment, respectively. The parameters of the model are available in Supplementary Table S3. (b) Intercepts for variables included as random effects in the model described in panel (a): insulin plus heparin (INS + HEP) and hydrocortisone (HYDRO) at Round 1 and Round 2. (c) Boxplot showing the median and interquartile range (IQR) of the distance to the erythroblast centroid for samples of differentiation Round 2 (n = 119 samples), divided by whether they received INS + HEP during Round 1. Whiskers extend to 1.5IQR. All P-values correspond to a two-sided Wilcoxon rank sum test adjusted for multiple hypothesis testing using the Benjamini–Hochberg method. (d) Boxplots as in panel (c), showing samples divided by INS + HEP at Round 2.*

Altogether, this suggests that the timing of some of ingredients may impact differentiation performance and, more generally, that the distance to the target in the cell-type-informative PCA can capture large systematic effects of treatments.

In vitro-differentiated erythroblast-like cells show residual accessibility of progenitor chromatin modules

Our preceding analysis provided a bird’s-eye view of effects of treatments on the phenotype of differentiating cells. We wondered which genomic regions where most affected by the treatments at the chromatin level and how this influenced the acquisition of erythroblast identity. To investigate this question, we searched for chromatin regions whose accessibility would be highly predictive of a cell type, irrespective of how large its fold change was. To this end, we employed ROC analysis to evaluate 314 431 high-priority regions (see the ‘Materials and methods’ section) on their ability to discriminate each cell type based on accessibility (Supplementary Fig. S7a). For instance, an ROC AUC of 1 would indicate that all samples of a given cell type (and no other samples) had the highest accessibility and thus the region was unequivocally predictive for that cell type. An AUC ≤0.05 would indicate that the region was uniquely inaccessible in that cell type and thus was also (negatively) predictive. Lastly, an AUC of ~0.5 is equivalent to a random guess between two groups.

In total, we found 35 533 peaks with an ROC AUC greater or equal than 0.95 for any cell type (Fig. 3a, Supplementary Fig. S7b and c, and Supplementary Table S4). The selected regions were often proximal of genes expressed in the respective cell types (Supplementary Fig. S7d and Supplementary Table S5) and of DNA binding motifs of TFs with a known association to the queried cell types (Fig. 3c and Supplementary Table S6). For example, motifs of the KLF-family TFs were exclusively enriched in erythroblast-specific regions, consistent with the key role of KLF1 in erythroid specification [52–54]. This confirms that the ROC analysis identified biologically relevant regions, which we refer to as cell-type-specific chromatin modules.

Effects of treatment ingredients on cell-type-specific chromatin regions. (a) Heatmap showing depth-normalised counts (row-scaled) of cell-type-specific chromatin regions. Cell-type-specific chromatin modules were defined by the AUC of the ROC to be strongly predictive of each cell type in the haematopoietic reference [31] (see Supplementary Fig. S7a). Ery, erythroblast; MEP, megakaryocyte-erythroid progenitor; CMP, common myeloid progenitor; LMPP, lymphoid-primed multipotent progenitor cells; MPP, multipotent progenitor; HSC, haematopoietic stem cell. Annotated peaks are provided in Supplementary Table S4. n(peaks) = 595 523, n(samples) = 78. (b) Heatmap analogous to panel (a), showing the accessibility of samples from in vitro differentiation generated in this study. The ingredient concentrations used for each sample are indicated on top. Selected blocks of coordinates chromatin changes have been highlighted manually with black boxes, as described in the text. n(samples) = 151. (c) Enriched TF binding motifs of each cell-type-specific chromatin module relative to all consensus peaks found in the reference dataset. Regions with >0 motif hit for a TF were counted for all regions in each module (circle size), compared to the background (fold change indicated by colour). The top five enriched TFs were selected for visualisation and all enrichment results (hypergeometric test) are listed in Supplementary Table S6. (d) Barplots of relative module scores of the cell identity modules in panel (a). Scores were calculated across all samples using single-sample gene set variation analysis [45] (GSVA). All scores in the barplots were scaled such that a score of 0 matched the average of HSCs and 1 matched the average score for the original cell type (see the ‘Materials and methods’ section). Error bars correspond to the mean ± the standard deviation for each subgroup. (e) Biplots showing contrasting the effect of hydrocortisone (HYDRO) in Round 1 on Ery module score and the CMP module score. Plotting symbols depict a different subgroup of samples, with symbols indicating the cell type and colour indicating the treatment status (for in vitro samples only). n(samples) = 183. (f) Biplot analogous to panel (e) showing the effect of insulin + heparin (INS + HEP) in Round 2 on the Ery and MEP module scores. (g) Biplot showing the effect of insulin + heparin (INS + HEP) in Round 2 on the Ery and B-cell module scores.

Having confirmed that our collection of cell identity peaks constituted functional chromatin modules of each cell type, we then interrogated the activity of each module in our in vitro samples (Fig. 3b). We found that treatments systematically influenced the accessibility of the modules. In line with our observation that INS + HEP in Round 2 had a hindering effect on erythroid differentiation (cp. Fig. 2), samples with this treatment showed increased accessibility of non-Ery modules, including the HSC module, which suggests incomplete differentiation (Fig. 3b). Additionally, samples resulting from treatments using HYDRO at Round 1 displayed low accessibility in the erythroblast-specific module (Fig. 3b).

These observations suggest that different treatments impacted chromatin beyond the cell-type modules of erythroid cells. To grasp the relative magnitude of these effects, we used the accessibility of all regions in each cell-type module to quantify chromatin-based ‘cell-type module scores’, using single-sample gene set variation analysis [45] (Fig. 3d). All reference cell-type samples scored 1.0 on average in their own cell-type score, while cell types of the same lineage (e.g. CMP, MEP, erythroblasts) also had non-zero scores for related modules (Fig. 3d). The score approached ~0 with more distance in the lineage (e.g. B cells scored ~0 on the Ery cell-type score). All in vitro samples had high Ery cell-type scores, yet lower than that of Ery samples in the reference. They also scored high for both MEP and CMP modules.

For deeper insight on the specificity of the effects of treatments, we juxtaposed pairs of modules. We observed that samples receiving HYDRO in Round 1 had an increased Ery module score but left the MEP score unaffected (Fig. 3e). On the other hand, treatment with INS + HEP in the second round decreased the MEP module score—a desired effect—yet this came at the cost of a reduced Ery score (Fig. 3f) and an increased B cell module score (Fig. 3g).

Altogether, these observations link the previously observed effects of different treatments to concrete changes in the accessibility of specific chromatin regions and show that the quantification chromatin module scores provide a measure of the mixed identity of cells as a function of other cell types. The Ery module of cells generated by in vitro differentiation is not as strongly accessible as in reference erythroblasts and coexists with residual progenitor chromatin programs, which hold them from fully achieving the erythroid cell state.

Transcription factor analysis prioritises candidate culture components to correct the discrepancies between in vitro and in vivo cells

As established in the previous sections, acquisition of a full erythroblast identity consists of both opening of Ery-specific chromatin and closure of non-Ery chromatin. While our samples progressed towards such erythroblast identity (cp. Fig. 1), they did not fully acquire the erythroblast chromatin state even with the best-performing treatment (5E) (cp. Fig. 3d–f). A close inspection of the accessibility of chromatin modules after differentiation confirmed that many Ery-specific regions had gained accessibility, while many HSC-, CMP-, and lymphocyte-specific regions had lost accessibility (Supplementary Fig. S8a). However, regions in the MEP module and other non-Ery regions (AUC ≤ 0.05) remained accessible at the end of the differentiation protocol. We summarised these sets of regions as ‘concordant open’, ‘concordant closed’, and ‘discordant open’, respectively (Fig. 4a). The culture ingredients tested as part of our treatments did not sufficiently address the discordant regions, and we hypothesised that rectifying these discrepancies to the in vivo target cells would likely improve fidelity of differentiation.

Mismatches persist in chromatin adaptation to erythroblasts even for best-performing protocols. (a) Sorted region heatmaps comparing accessibility between erythroblast samples [Ery; one representative sample from [31] and two replicate samples after differentiation with the best available treatment in vitro (treatment 5E = high EPO over two rounds)]. Each block corresponds to cell-type-specific chromatin modules that have been arranged by likeness to the Ery reference into concordantly open regions (top; identical to Ery-specific chromatin module), discordantly open (middle; MEP- and non-Ery regions), and concordantly closed (bottom; HSPC and lymphocyte-specific regions; see Supplementary Fig. S8a). Concordant regions (top and bottom) match between Ery and 5E, whereas discordant regions (middle) remain erroneously accessible in treatment 5E. (b) Direct comparison of the percentage of regions with >0 TF motifs from the HOCOMOCO [44] v11 database in the concordant (x axis) versus the discordant (y axis) regions. Colour represents the log10P-value of a two-sided Fisher test calculated on a two-dimensional contingency table for that motif (number of motif hits in 4872 discordant regions versus 4218 concordant regions). We label TFs with motifs absolute percentage changes ≥0.15, or those uniquely enriched in the discordant regions (percentage ≥40%), which were used for the enrichments below. The motif frequencies in concordant and discordant regions are provided in Supplementary Table S7.

We reasoned that targeting TFs that bind to the discordantly open regions was a promising strategy to rationally enhance treatment protocols. We therefore searched for DNA binding motifs that occurred more frequently in discordant regions compared to Ery regions (Fig. 4b). In agreement with our earlier analyses (cp. Fig. 3c), KLF motifs were abundant in the Ery regions and less pervasive in discordant regions (Fig. 4b). In contrast, we found motifs of interferon response factors (IRF1/2/3/5/8), STAT TFs (STAT1/2), and key regulators of HSPCs (ERG, FLI1) enriched in discordant regions (Supplementary Table S7). Many of these TFs are regulators of other (non-erythroid) haematopoietic lineages (Supplementary Fig. S8b and c). We ranked TFs by the enrichment of their motifs and kept only those TFs with motifs in at least 40% of discordant regions as ‘priority TFs’ (Fig. 4b, top region). No single motif was found in more 60% of the regions, suggesting that the discordant regions might be a collection of different functional programmes.

Next, we searched for ligand perturbations that could inhibit the expression of the selected TFs. To do this, we performed gene set enrichment analysis of the priority TFs against ligand perturbation databases using hypeR [55, 56] (Fig. 5a and Supplementary Table S8). Among the top hits resulting in down-regulation of the TFs in other cell types were multiple interleukins (TGF-β, IL-4, IL-10, IL-15) and steroid hormones (estradiol, testosterone), previously used by others for haematopoietic differentiation in vitro [57–64]. These observations support the relevance of our selection. To validate the computational predictions, we tested the inclusion of TGF-β1 and IL-4 on top of the best-performing treatment 5E. Previous research showed that TGF-β impairs the proliferation of stem and erythroid cell cultures [59]. To mitigate this, we opted for a low-dose and treated cells with 0.1 ng/ml TGF-β from day 1 or day 7. We analysed the conventional erythroid-associated markers CD71, CD36, and CD235a by flow cytometry to evaluate the degree of erythroid differentiation at day 7 (Round 1) and day 11 (Round 2) (Fig. 5c and d, and Supplementary Fig. S9a–c). Already at day 7, we observed that the addition of 0.1 ng/ml TGF-β to the previously top-performing protocol led to a significant increase of CD71^+^/CD235a^+^ cells (immature erythroblasts) and a reduction of CD71^−^/CD36^+^ cells compared to the control set (Fig. 5b and Supplementary Fig. S9b). The addition of IL-4 also modestly enhanced the proportion of CD71^+^/CD36^+^ double-positive erythroid progenitors. However, the effect of IL-4 on erythropoiesis was not evident anymore after Round 2, when assessing the marker expression by flow cytometry (Fig. 5c and Supplementary Fig. S9c). In contrast, the effects of TGF-β were even more pronounced at the end of the second round of differentiation (after 11 days), as shown by a significantly higher proportion of cells in an advanced maturation stage (CD71^+^/CD235a^+^). These effects were even stronger if TGF-β was administered from the beginning of differentiation (Fig. 5d), showing that differentiating cells had already advanced further on the erythropoietic trajectory at day 7 and consolidated in the CD71^+^/CD235a^+^ state at day 11 (Supplementary Fig. S9d). The improved differentiation outcome was not accompanied by a measurable loss in yield under using the low doses of TGF-β applied in our experiments (Supplementary Fig. S9e). To functionally test the effect of TGF-β and IL-4 on erythroid maturation, we measured the levels of HBB by RT-qPCR and observed that TGF-β increased HBB levels at day 11 and day 14 of differentiation (Fig. 5e). In agreement, pellets form TGF-β treatment appeared qualitatively redder, suggesting a higher haemoglobin content (Supplementary Fig. S10a). We then used Propidium Iodide (PI) and Hoechst staining to quantify the fraction of enucleated cells (indicating reticulocytes) and observed a mild, yet significant increase in double-negative cells at day 14 upon TGF-β treatment (Fig. 5f and Supplementary Fig. S10b); similarly, we observed a higher proportion of CD235a^+^ cells (Supplementary Fig. S10c). Finally, we performed Wright–Giemsa staining to measure cell morphology (Fig. 5g). In agreement with our flow cytometry analyses, cytospins showed a higher proportion of orthochromatic and polychromatic cells (late-stage erythroblasts) when cells were treated with TGF-β (Fig. 5h). These functional assays support the notion that our data-driven approach resulted in functionally more mature, differentiated cells.

*Data-driven selection of protocol modifications to correct chromatin mismatches between in vitro and in vivo cells. (a) Gene set overlap analysis comparing TFs associated with the discordant-enriched motifs in Fig. 4b with genes affected by perturbations ligand perturbations from the Enrichr database (collections: ‘Ligand_Perturbations_from_GEO_down/up’) using hypeR (background: all 676 motifs in the HOCOMOCO database; all tested gene sets were restricted to genes in this background). Gene sets were ranked by the Jaccard index. Overlap results are provided in Supplementary Table S8. (b) Schematic overview of validation experiments for data-driven improvements to Ery differentiation based on the enrichment results in panel (a). We tested the original protocol (8G), the best-performing protocol (5E), in comparison to a modified version of protocol 5E with the addition of IL-4 or TGF-β in both rounds of differentiation (IL-4_1, TGF-β_1) or only in Round 2 (IL-4_2, TGF-β_2). (c, d) Flow cytometry results comparing Ery markers at Round 1 (day 7, panel c) and Round 2 (day 11, panel d) of differentiation. Each panel shows, in order: the percentage of CD71/CD36-double-positive cells, of CD36-positive cells, of CD71-positive cells, and of CD71/CD235a-double-positive cells. Statistical analyses using ordinary one-way ANOVA. *P < .05; **P < .01; ***P < .001; ****P < .0001. (e) RT-qPCR analysis of HBB expression at day 11 and 14. Bars represent means and standard deviations of three biological replicates. Statistical analyses using ordinary one-way ANOVA. **P < .01. (f) Quantification of enucleated cells as Hoechst/PI double-negative cells by flow cytometry. Bars represent means and standard deviations of three biological replicates. Statistical analyses using ordinary one-way ANOVA. ***P < .001. (g) Wright–Giemsa of cytospin preparations of cells at day 11 of erythroid differentiation. Late erythroblast cells are marked with an arrow. (h) Quantification of late erythroblast cells upon treatment with TGF-β or IL-4. Bars represent means and standard deviation of three biological replicates. One thousand cells were counted and staged per replicate. Statistical analyses using ordinary one-way ANOVA. P < .05. (i) Distribution of log2-fold changes per peak group (compared to all other peaks) in each adapted protocol compared to treatment 5E. Green and blue tones correspond to TGF-β and IL-4 treatment, respectively. Filled and dotted lines correspond to the treatment applied in both rounds (TGF-β_1, IL-4_1), or only at Round 2 (TGF-β_2, IL-4_2), respectively.

To assess whether the observed advancement of erythropoiesis was linked to changes at chromatin level, we performed ATAC-seq of at the end of Round 2 of the modified differentiation protocols in comparison to treatment 5E (data overview and quality metrics in Supplementary Table S1). We found that chromatin accessibility was globally very similar to the base protocol 5E (Supplementary Fig. S10d). In comparison to 5E, TGF-β supplementation throughout differentiation (TGF-β_1) yielded a further increase in the accessibility of concordantly open and a further decrease of accessibility in concordantly closed regions, respectively (Fig. 5i). Regions that previously retained accessibility (discordantly open), were less accessible. However, these changes were small, indicating that only a fraction of cells was affected (e.g. median log_2_ fold change TGF-β_1 versus 5E in discordant peaks = −0.2). Only 42 peaks were significantly affected in at least one of the adapted treatments (DESeq2 [40], FDR-adjusted P-value ≤.05; Supplementary Table S9). While most of these peaks had matches to motifs of priority TFs (Supplementary Fig. S10e), it is unclear from our data whether this had a functional relevance since the same motifs are common among peaks that are not significantly affected (FDR-adjusted P-value >0.05). Rather, the primary mode of action of the TGF-β on erythropoiesis might be not via directly altering chromatin but by modulation of the activities of TFs binding in those regions. Changes in chromatin would likely be a secondary effect resulting from advanced differentiation leading the progressive closing down of regions, like for instance for the promoter of Rho GTPase activating protein 10 (ARHGAP10), which is gradually closing throughout erythropoiesis in vivo and which is among the significantly affected discordant peaks in our data (Supplementary Fig. S10f).

Altogether, our results confirm that supplementation of our tested differentiation protocols with TGF-β improves erythroblast production. This example showcases how in-depth analysis of chromatin during in vitro differentiation not only allows to rank treatments by similarity to the reference, but can also provide a rational, data-driven shortlist of additional treatments tailored to overcome shortcomings in chromatin programming in differentiation protocols.

Discussion

In this study, we introduced an approach for optimizing directed differentiation of stem cells by iteratively approximating the chromatin state of target cells. Applied to the differentiation of HSCs into erythroblasts, we ranked differentiating cells using chromatin distance as a simple, objective measure of similarity. Our results indicate that a greedy strategy, which iteratively selects the treatment closest to the target at each step, yielded the optimal protocol, while simplifying the experimental design and reducing workload. The consistency between chromatin distance and conventional cell fidelity markers (CD36, CD71) at the end of differentiation validates this approach.

Moreover, with access to a full chromatin accessibility profile, we exposed residual discrepancies to the desired target cells even in the best-performing protocols. The observed deviations might indicate meta-stable intermediate cell states in which competing lineage programmes have not yet been inactivated completely [65–67]. Lacking continuous interactions with the surrounding niche to refine and stabilise signalling, the precision of differentiation trajectories diminishes in vitro. Additional rounds of differentiation or alternative treatments may be necessary to close this gap. Our analysis identified specific chromatin regions associated with myeloid progenitors that were incompletely closed in differentiated cells. Harnessing discrepancies in the footprints of specific TFs (e.g. STAT1/2 and IRF1/2/8), allowed evidence-based prioritisation of candidate perturbations that may help to achieve the desired erythroid cell state, amongst them IL-4 and TGF-β. Consistent with these bioinformatic predictions, low doses of TGF-β have previously been reported to promote erythropoiesis [58]. In contrast, it was also shown that high concentrations of TGF-β might downregulate erythrocyte production by inhibiting erythroblast proliferation [59]. Our validation experiments proved that supplementation of the best-performing protocol 5E (SCF, 3IU EPO) with low-dose TGF-β (0.1 ng/ml), especially if administered from the beginning of differentiation, fast-tracked erythroid differentiation with a significantly increased yield of CD71^+^/ CD235a^+^ cells already after 11 days of differentiation. This confirms the validity of our approach for data-driven prioritization of modifications to existing differentiation protocols.

Our work demonstrates the potential of chromatin-guided, greedy selection for simplifying and enhancing differentiation protocols. The methodology can be applied even at low sequencing depth lending itself to high multiplexing (Supplementary Fig. S4). We found chromatin distance anti-correlated with proximity to target cell type in multiple, but not all haemopoietic cell types (Supplementary Fig. S5), arguing the approach may be extended to other cell types that are challenging to produce in vitro, if appropriate care is taken. Moreover, the approach can conceptually also be extended to transdifferentiation and reprogramming—although further work is required to confirm validity in these settings. The main challenge to be solved for any new target cell type is to define the target chromatin state and a ‘neighbourhood’ of alternative (undesirable) cell fates, as a reference for chromatin distance. With the increasing availability of multimodal single-cell reference atlases, defining such cell states will soon be feasible even for cell types that were previously poorly described and without existing differentiation protocols [32, 35, 68–71]. A selection of treatment options (including culture components as tested here, but perhaps also including other chemical and mechanical stimulants, co-cultures, etc.) may be selected from prior knowledge or using data-driven selection between starting and target cells, as we have done here in the final steps of our protocols. The computational pipeline we developed facilitates these analyses and comparison of test samples to defined target cell types.

Despite the advantages presented here, our approach has limitations: first, we cannot exclude that the greedy selection may in some cases result in solutions which are not the global optimum. We have chosen a chromatin-based distance as an objective function because it captures the genome-wide meta-stable epigenetic state of cells and deemphasises temporary fluctuations [compared to messenger RNA (mRNA) expression]. While prior research argues for the superiority of chromatin accessibility measurements over mRNA expression for distinguishing haematopoietic cell types [31], we have not demonstrated this superiority in our experiments. However, no one-dimensional function can weigh all aspects of the underlying biology thoroughly; hence, complementary functional assays are important to evaluate the final cell products. Relatedly, we defined chromatin distance as a linear metric in Euclidean PCA space. We are aware that this is an oversimplification. In vivo, differentiation often proceeds via multiple progenitor stages that may not immediately resemble the final target cell. In such cases, chromatin distance can be used stepwise to approximate intermediates first before proceeding to the next differentiation process. In future, more complex, non-linear chromatin distance models could help directly factorise differentiation stages and technical covariates into a compound distance metric. Second, we did not directly address the heterogeneity of differentiated cells but rather opted for a bulk assay as a straightforward, cost-effective solution. This does not impair the search for homogeneous cell products, since the average genome-wide profiles of mixed cell populations will deviate from the specific target profile, which would lead to protocols resulting in mixed populations being deselected by our greedy optimisation strategy. In practice, we recommend pairing ATAC-seq with flow cytometric or imaging-based assays at single-cell resolution to confirm homogeneity of differentiated cells. Finally, scalability and practicability of the generation of products for cell therapies depends on technical factors beyond the phenotypic approximation.

As a proof of principle, we have decided to work here with an exhaustive setup testing all combinations of 8 × 8 treatments in two rounds of sequential treatments, and to trace back the data generated during the process to retrospectively evaluate what result the greedy selection would have yielded. In an actual application, this optimisation may be done in cycles for an arbitrary number of treatment rounds, evaluating the result after Round 1 first, restarting with the best treatment and varying treatments at Round 2, again picking the best solution, and so on. This works well with a limited number of rounds but remains a time-consuming procedure. Conceptually, treatment evaluation and readjustment may be done live throughout an ongoing experiment: our computational simulations (Supplementary Fig. S4) suggest that even low-coverage ATAC-seq (e.g. with current benchtop sequencing machines) would be sufficient to estimate chromatin distance, potentially allowing best-treatment selection in between steps of an ongoing differentiation protocol (typically 2–5 days). However, while this methodology allows prioritisation of candidate treatments, chromatin distance does not necessarily translate directly into physiological relevance. It is therefore crucial to test all prioritised protocols by validating protein markers and by examining functional characteristics of the generated cell products.

Supplementary Material

gkag124_Supplemental_Files

Bibliography71

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mao J, Saiding Q, Qian S et al. Reprogramming stem cells in regenerative medicine. Smart Med. 2022;1:e 20220005. 10.1002/SMMD.20220005.39188749 PMC 11235200 · doi ↗ · pubmed ↗
2Cohen DE, Melton D. Turning straw into gold: directing cell fate for regenerative medicine. Nat Rev Genet. 2011;12:243–52. 10.1038/nrg 2938.21386864 · doi ↗ · pubmed ↗
3Cahan P, Li H, Morris SA et al. Cell Net: network biology applied to stem cell engineering. Cell. 2014;158:903–15. 10.1016/j.cell.2014.07.020.25126793 PMC 4233680 · doi ↗ · pubmed ↗
4Kirouac DC, Zandstra PW. The systematic production of cells for cell therapies. Cell Stem Cell. 2008;3:369–81. 10.1016/j.stem.2008.09.001.18940729 · doi ↗ · pubmed ↗
5Murry CE, Keller G. Differentiation of embryonic stem cells to clinically relevant populations: lessons from embryonic development. Cell. 2008;132:661–80. 10.1016/j.cell.2008.02.008.18295582 · doi ↗ · pubmed ↗
6Liu S, Striebel J, Pasquini G et al. Neuronal cell-type engineering by transcriptional activation. Front Genome Ed. 2021;3:715697. 10.3389/fgeed.2021.715697.34713262 PMC 8525383 · doi ↗ · pubmed ↗
7Morris SA, Cahan P, Li H et al. Dissecting engineered cell types and enhancing cell fate conversion via Cellnet. Cell. 2014;158:889–902. 10.1016/j.cell.2014.07.021.25126792 PMC 4291075 · doi ↗ · pubmed ↗
8Poldee S, Metheetrairut C, Nugoolsuksiri S et al. Optimization of an erythroid culture system to reduce the cost of in vitro production of red blood cells. Methods X. 2018;5:1626–32. 10.1016/j.mex.2018.11.018.30560064 PMC 6290122 · doi ↗ · pubmed ↗