An integrated study combining network toxicology machine learning and molecular simulation reveals the molecular mechanisms of permanent hair dyes in breast cancer
Xiaolu Yang, Yilun Li, Tianqi Zhang, Binglu He, Jingyan Wang, Shiyu Zhang, Li Ma

TL;DR
This study combines machine learning and molecular simulations to uncover how permanent hair dye ingredients may contribute to breast cancer development.
Contribution
The study introduces an integrated approach using network toxicology and machine learning to identify key molecular targets linking hair dyes to breast cancer.
Findings
Eight key targets (e.g., HSP90AA1, CDK1) were identified as regulators of breast cancer progression linked to hair dye ingredients.
Disperse Yellow 3 showed the strongest binding affinity to key targets, indicating a strong association with breast cancer risk.
Machine learning confirmed the prognostic importance of SRC, HSP90AB1, HSP90AA1, and CDK1 in breast cancer.
Abstract
Permanent hair dyes have been linked to an increased risk of breast cancer (BC), though the underlying mechanisms remain unclear. To address this knowledge gap, our investigation employed an integrated approach combining network toxicology, molecular docking, molecular dynamics simulations, and machine learning to decipher the molecular mechanisms by which permanent hair dyes might promote BC pathogenesis. Five permanent hair dye ingredients classified by IARC as carcinogenic were included in this study: p-phenylenediamine, resorcinol, pyridine, Disperse Yellow 3, and HC Blue No. 2. These chemicals can regulate BC progression through various signaling pathways, with key core targets identified as HSP90AA1, HSP90AB1, ESR1, CDK1, STAT3, MAPK8, HDAC1, and SRC. A machine learning model comprising 128 algorithms confirmed that these eight targets possess strong prognostic predictive…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDyeing and Modifying Textile Fibers · Biological Stains and Phytochemicals · Skin Protection and Aging
Introduction
Individuals are exposed to a variety of chemicals in daily life, many of which are carcinogenic. However, a large amount of harmful substances are hidden in daily necessities, and users are often unaware of them. Permanent hair dye is a typical example—this common product contains a variety of carcinogens [1, 2]. Permanent hair dyes, also known as oxidative hair dyes, rely on an oxidation process for coloring. Typically, they consist of three components: (1) intermediate agents like p-phenylenediamine (PPD); (2) coupling agents such as resorcinol (REN) and m-phenylenediamine; and (3) oxidizing agents like hydrogen peroxide. The intermediate and coupling agents together form dye precursors. During application, these precursors undergo oxidation in the presence of an oxidizing agent, resulting in the formation of colored macromolecules that are encapsulated within the hair, thereby altering its color [3]. Variations in the types and proportions of intermediates and coupling agents produce different shades.
More than 30% of women in Western countries use hair dyes [4], and with its widespread use, concerns about its potential health effects are growing. Some studies show that the use of permanent hair dye may be associated with breast cancer (BC), the most prevalent cancer in women globally [5]. Research shows that women who use hair dyes have a 23% higher risk of BC than non-users [6]. In addition, a prospective cohort study revealed that the use of permanent hair dyes increases the risk of BC in black women by 45% and white women by 7% [7].
Although epidemiological studies have confirmed that permanent hair dyes are associated with an increased risk of BC, their potential mechanism remains unclear. The progress in the field of toxicology, especially the development of network toxicology, enables researchers to comprehensively analyse the impact of chemicals on the human body from a holistic perspective [8]. However, applying network toxicology alone can only identify the targets at which compounds act on diseases, without further determining the impact of these targets on disease prognosis. To better assess the impact of compounds on diseases and even disease prognosis, we present for the first time an integrated computational framework that combines network toxicology, machine learning, molecular docking, and molecular dynamics simulations. This integrated approach not only identifies compounds' potential core targets and signaling pathways in diseases but also validates their prognostic relevance in disease contexts. Overall, this multidisciplinary methodology will yield new insights into the safety of permanent hair dyes.
Methods
Identification of targets for carcinogenic chemicals in permanent hair dyes
First, we identified the chemical components contained in permanent hair dyes through a comprehensive literature review and evaluated their toxicity using the PubChem database. All compounds classified as carcinogens in the IARC registry were included in the analysis. Ultimately, five compounds were selected for further study: PPD, REN, pyridine (PYD), Disperse Yellow 3 (DY3), and HC Blue No. 2 (HB2). These chemicals are summarized in Table 1. Potential targets for these chemicals were retrieved from three databases: SEA (https://sea.bkslab.org/), STP (http://swisstargetprediction.ch/), and STITCH (http://stitch.embl.de/) [9]. All three databases provide target information for compounds.Table 1. Carcinogenic chemicals in permanent hair dyesChemicalMolecular formulaMolecular weightStructurep-phenylenediamineC_6_H_8_N_2_108.14ResorcinolC_6_H_6_O_2_110.11PyridineC_5_H_5_N79.10Disperse Yellow 3C_15_H_15_N_3_O_2_269.3HC Blue No. 2C_12_H_19_N_3_O_5_285.3
Confirmation of a common target for permanent hair dyes and BC
To identify BC-related targets, the Genecards (with a relevance score of 10 or more) (https://www.genecards.org/), TTD (https://db.idrblab.net/ttd/), and OMIM (https://www.omim.org/) databases were searched using the keyword "breast cancer" [10]. The VennDiagram package in R (version 4.2.1) was used to find overlapping targets between permanent hair dye chemicals and BC.
Enrichment analysis
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed on the intersecting targets to explore the biological implications of permanent hair dyes in BC.
Construction of the protein–protein interaction (PPI) network
A PPI network for the intersecting targets was constructed using the STRING database with a minimum interaction score of 0.9, visualized with Cytoscape (version 3.9.1).
Identification of hub targets
The network analysis tool in Cytoscape software was used to analyze the topological parameters of the PPI network. Targets with degree and betweenness centrality values greater than twice the mean were considered potential core targets, resulting in the identification of eight such targets. To evaluate the prognostic value of the eight core targets, we constructed 128 ensemble models using 10 machine learning algorithms (Supplementary Table S1), trained them on the GSE20685 dataset, and validated them on two independent cohorts (GSE16446 and GSE48390), with the predictive performance assessed by the average AUC value. Ultimately, the three datasets were merged and normalized to eliminate batch effects, followed by a SHapley Additive exPlanations (SHAP) analysis to quantify the contribution of each target. Additionally, we further analyzed the differential expression of these core targets in BC and normal breast tissue, as well as their association with prognosis, using the TCGA-BRCA cohort.
Single-cell analysis
Four targets were identified as key targets closely associated with BC prognosis: HSP90AA1, HSP90AB1, CDK1, and SRC. We analyzed the expression levels of these four targets in different cell types within BC using the single-cell sequencing dataset GSE161529.
Molecular docking analysis
Molecular docking was performed to predict the binding affinities between the five hair dye constituents and the four core protein targets [11]. The three-dimensional crystal structures of the target proteins—HSP90AA1 (PDB ID: 1BYQ), HSP90AB1 (PDB ID: 1QZ2), CDK1 (PDB ID: 4Y72), and SRC (PDB ID: 1A07)—were retrieved from the Protein Data Bank (https://www.rcsb.org/). Protein structures were preprocessed by removing water molecules and adding hydrogen atoms using PyMOL. The SDF files of the chemical ligands were obtained from PubChem and subsequently subjected to energy minimization using Chem3D [12]. All docking simulations were carried out using AutoDockTools (version 1.5.7) with parameters detailed in Supplementary Table S2. To validate the reliability of our docking protocol, we performed a redocking procedure. The native ligand was re-docked into its original active site using the parameters described above. The root-mean-square deviation (RMSD) between the redocked pose and the original crystallographic pose was calculated for each protein. All RMSD values were below 2.0 Å, confirming the accuracy of our docking workflow (Supplementary Table S3). Following this validation, molecular docking of the five hair dye chemicals was conducted. The resulting protein–ligand complexes were visualized, and their binding energies were calculated using PyMOL.
Molecular dynamics simulation
Molecular dynamics simulations were carried out with Gromacs 2023 for 100 ns at 300 K and 1 bar pressure. The CHARMM 36 force field parameters were applied to the proteins, while ligand topologies were generated using GAFF2 [13]. Electrostatic interactions were modelled with particle mesh Ewald and Verlet algorithms, and a 1.0 nm cutoff was used for van der Waals and Coulomb interactions.
Results
Identifying potential targets for permanent hair dyes and BC
After removing duplicate targets, a total of 418 targets for the five chemical components of permanent hair dyes were identified from the SEA, STP, and STITCH databases. Additionally, 3,508 BC-related targets were retrieved from the Genecards, TTD, and OMIM databases. Integration of these targets resulted in 203 intersecting targets, considered potential mediators of the carcinogenic effects of permanent hair dyes on BC (Fig. 1). A complete list of the targets for permanent hair dyes, BC, and their intersections is available in Supplementary Table S4.Fig. 1. Common targets associated with permanent hair dyes and breast cancer
Constructing a network of permanent hair dyes and potential targets
To investigate the impact of permanent hair dyes on BC, a network was constructed linking the five carcinogenic chemicals to the 203 intersecting genes (Fig. 2). Among the five chemicals, DY3 exhibited the strongest association with BC, targeting 79 genes, followed by HB2 (68), REN (46), PYD (34), and PPD (29).Fig. 2. Relationship between permanent hair dyes and their corresponding targets
Enrichment analyses
GO and KEGG analyses were conducted to explore the functions and pathways influenced by these chemicals. A total of 2789 GO terms were identified, including 2488 biological processes (BPs), 92 cellular components (CCs), and 209 molecular functions (MFs) (Supplementary Table S5). The top 20 GO terms are visualized in Fig. 3A–3C. Furthermore, the five chemicals affected 168 KEGG pathways, with several cancer-related pathways among the top 20, such as MAPK signaling, PI3K-Akt signaling, the cell cycle, and apoptosis (Fig. 3D). Eighteen of these pathways are directly linked to cancer, including those related to prostate, pancreatic, breast, and thyroid cancers (Supplementary Table S5).Fig. 3. Enrichment analysis of 203 common targets. A–C Top 20 items in A BP, B CC, and C MF affected by permanent hair dyes in BC. D Top 20 KEGG pathways influenced by permanent hair dyes in BC
Building the PPI networks and identifying potential core targets
To elucidate the mechanisms by which these chemicals contribute to BC, a PPI network was generated using the STRING database for the 203 intersecting genes (Fig. 4A) and visualized in Cytoscape (Fig. 4B). Larger nodes and darker colors indicate higher degree values. Core targets were identified by analyzing the topological parameters of the PPI network, including degree and betweenness centrality, both indicative of node importance [14]. The average degree value was 6.80, and the average betweenness centrality was 0.12. Eight targets with both degree and betweenness centrality values greater than twice the mean were selected as potential core targets: HSP90AA1, HSP90AB1, ESR1, CDK1, STAT3, MAPK8, HDAC1, and SRC (Fig. 4C).Fig. 4PPI network and identification of potential core targets for 203 common targets. A PPI network for 203 common targets. B Processed PPI network visualized using Cytoscape software. C Potential core targets obtained by screening
Machine learning identifies 4 key targets closely associated with BC prognosis
To comprehensively assess the relationship between eight potential core targets and prognosis, we constructed 128 machine learning models. As illustrated in Fig. 5A, models based on these targets effectively predicted patient outcomes. The ensemble model combining glmBoost and Random Forest (RF) demonstrated the best performance, achieving an AUC of 0.733. Next, we merged these three datasets and normalized the gene expression matrix to eliminate batch effects. Principal component analysis (PCA) indicates that the batch effects in the three datasets have been effectively reduced (Fig. 5B–C). We quantified the contribution of these eight targets to the model using SHAP analysis. The top four targets with the highest contributions were SRC, HSP90AB1, HSP90AA1 and CDK1 (Fig. 5D–E). Force-directed analysis further demonstrates that these four targets are primary negative regulators of the shap value, indicating a negative correlation with prognosis in patients with BC (Fig. 5F).Fig. 5. Identification of core prognostic genes in breast cancer (BC) by integrating machine learning and SHAP analysis. A Performance of machine learning algorithms in predicting prognosis, quantified by mean AUC values. B–C PCA scatter plots of the dataset (B) before and (C) after normalization, illustrating the effect of normalization in mitigating batch effects. D, E SHAP analysis evaluating the contribution of the eight core features, represented as a (D) bar chart and an (E) beeswarm plot. F SHAP summary plot depicting feature-wise impact on model predictions
Additionally, we further utilized the TCGA-BRCA cohort to explore the relationship between these targets and BC. Among the eight core targets, the expression levels of HSP90AA1, HSP90AB1, ESR1, CDK1, HDAC1 and SRC in BC tissue were significantly higher than those of normal breast tissue (Fig. 6A–6F). On the contrary, the expression level of MAPK8 in BC tissue decreased (Fig. 6G). In addition, there is no significant difference in the expression of STAT3 in BC tissue and normal tissue (Fig. 6H). Regarding the analysis of prognostic associations, we found that the high expression levels of HSP90AA1, HSP90AB1, CDK1 and SRC were negatively correlated with the overall survival of patients with BC (Fig. 6I–6L), while the expression levels of the remaining four core targets were not related to the survival rate of BC (Fig. 6M–6P). The finding is consistent with the results of machine learning and SHAP analysis, which further confirms the correlation between these four key targets (HSP90AA1, HSP90AB1, CDK1 and SRC) and the poor prognosis of BC.Fig. 6. Core gene expression, prognostic correlation, and cellular profiling in BC. A–H Expression levels of eight targets—A HSP90AA1, B HSP90AB1, C ESR1, D CDK1, E HDAC1, F SRC, G MAPK8, and H STAT3—in BC samples from the TCGA-BRCA cohort. I–P Prognostic associations of gene expression for I HSP90AA1, J HSP90AB1, K CDK1, L SRC, M STAT3, N MAPK8, O HDAC1, and P ESR1 in the TCGA-BRCA cohort. Q The cellular landscape of BC in the GSE161529 cohort. R The expression profiles of four core targets in BC. (**P < 0.01; ***P < 0.001; ns: no significance)
Expression profile of core targets in BC
The cellular landscape of the BC microenvironment is depicted in Fig. 6Q. Corresponding expression profiling of the four core targets (Fig. 6R) revealed that HSP90AA1 and HSP90AB1 were widely expressed across both epithelial and mesenchymal cancer cells. In contrast, CDK1 expression was predominantly localized to epithelial cells and T cells, while SRC was mainly detected in epithelial cells and tumor-associated macrophages.
Molecular docking analysis
Through bioinformatics analysis, we identified HSP90AA, HSP90AB1, CDK1, and SRC as core targets, as they are highly expressed in BC and closely associated with poor prognosis. To investigate the relationship between these four core targets and five chemical compounds, we performed molecular docking between the targets and the compounds and calculated their binding energies. Figure 7 displays the binding energies of these interactions, with values below -5.5 indicating strong binding affinity between the target and compound [15]. Strong binding was observed for CDK1-DY3, HSP90AA1-DY3, CDK1-HB2, HSP90AB1-DY3, SRC-DY3, HSP90AA1-HB2, CDK1-REN, and SRC-REN. Notably, DY3 showed strong binding affinity with all four core targets. Figure 8 displays visualized molecular docking images, revealing hydrogen bonds formed in all eight complexes (Table 2). Crucially, we observed that the hydrogen bond-forming sites between the compound and the protein are located within the protein’s functional domains. This indicates that the compound can bind to the protein and exert its effects by influencing the protein's biological functions.Fig. 7. The binding energy of molecular docking between five chemicals and four core targetsFig. 8Molecular docking results of chemicals binding to targets. A–H Molecular docking plots: A CDK1-DY3, B CDK1-HB2, C CDK1-REN, D HSP90AA1-DY3, E HSP90AA1-HB2, F HSP90AB1-DY3, G SRC-DY3, and H SRC-RENTable 2Number and position of hydrogen bonds in molecular dockingReceptor-ligand complexNumber of hydrogen bondsHydrogen-bonding residuesCDK1-DY33ASP-128, GLN-132, ASP-146CDK1-HB24GLN-184, SER-227CDK1-REN1ASP-146HSP90AA1-DY31ASN-106HSP90AA1-HB24ASN-51HSPP90AB1-DY31SER-224SRC-DY32PTR-101, ASN-201SRC-REN4ARG-178, THR-182, THR-183
Molecular dynamics simulations
Molecular docking analysis indicated that DY3 has a strong binding ability with four core targets. In order to further explore its binding stability, we simulated the molecular dynamics of four complexes. The RMSD was used to evaluate the conformational stability of proteins and ligands. The smaller the deviation value, the higher the conformational stability of the protein–ligand complex. The structural changes of the protein–ligand complexes were evaluated by the radius of gyration (Rg). The smaller the Rg value, the more compact the structure. It was shown that the CDK1-DY3, HSP90AA1-DY3, and SRC-DY3 complexes quickly reached equilibrium during the simulation, with final RMSD values below 5 Å (Fig. 9A). Furthermore, the Rg values of these three complexes remained stable throughout the simulation, indicating that they were tightly packed and stably bound (Fig. 9B). In contrast, the RMSD and Rg values for HSP90AB1-DY3 fluctuated during the simulation. The solvent-accessible surface area (SASA) was used to evaluate protein folding and stability, and the SASA values for all four complexes remained stable during the simulations (Fig. 9C). Moreover, we employed the root mean square fluctuation (RMSF) metric to assess the flexibility of amino acid residues in proteins. The results revealed that the RMSF values for the complexes were predominantly below 5 Å, further corroborating the stability of the protein–ligand interactions (Fig. 9D). Additionally, hydrogen bonding plays a critical role in ligand–protein interactions. Figure 9E illustrates the number of hydrogen bonds between DY3 and the proteins during the simulations. Hydrogen bonds were consistently formed between DY3 and the four proteins, with at least two hydrogen bonds observed at most time points, suggesting stable interactions. Overall, the four complexes exhibited strong stability during molecular dynamics simulations, with the binding of DY3 to CDK1, HSP90AA1, and SRC being particularly stable.Fig. 9. Molecular dynamics simulations of DY3 binding with four core targets. A–E Values for the four complexes: A RMSD, B Rg, C SASA, D RMSF and E Number of Hbonds
Discussion
In modern society, the use of hair dyes has become increasingly common, with individuals starting to color their hair at younger ages. This trend increases the possibility of long-term exposure to certain chemicals in hair dyes that have been proven to pose health risks [1]. Hair dyes are associated with a variety of health problems, including allergic reactions, hair loss, and respiratory disorders [16–19]. In addition, research shows that the use of permanent hair dyes is associated with an increased risk of a variety of cancers, including bladder cancer, hematopoietic cancer, and BC [20–22]. Despite the existence of these associations, the potential mechanism is still unclear. By integrating network toxicology, molecular docking, molecular dynamics simulation and bioinformatics technology, this study has preliminarily revealed the mechanism by which permanent hair dyes may induce BC.
Through a comprehensive literature search and screening, five carcinogenic chemicals commonly found in permanent hair dyes were identified: PPD, REN, PYD, DY3, and HB2. These chemicals are classified as carcinogens by IARC [23]. Network toxicology analyses indicated these chemicals may regulate the progression of BC through multiple signalling pathways, and their core targets include HSP90AA1, HSP90AB1, ESR1, CDK1, STAT3, MAPK8, HDAC1, and SRC. Further screening through bioinformatics analyses, among which HSP90AA1, HSP90AB1, CDK1 and SRC were identified as core targets due to their high expression in BC tissue and closely related to poor prognosis. Molecular docking and molecular dynamics simulations further confirmed that DY3 exhibits the highest binding affinity with the mentioned four targets, making it the compound most strongly associated with BC risk.
These four core targets all play important biological roles in the human body. Specifically, HSP90AA1 and HSP90AB1, as central components of the heat shock protein family, function as essential molecular chaperones that facilitate the folding, stability, and maturation of a wide range of client proteins—many of which are implicated in oncogenic signaling [24]. By coordinating multiple regulatory pathways, they help maintain proteostasis and regulate gene expression under physiological and stress conditions [25]. Interference with HSP90 function may lead to the ubiquitination and degradation of its client proteins, thereby disrupting key survival and proliferation pathways in breast tissue [26]. CDK1, as a key regulator of the G2/M transition in the cell cycle, belongs to the cyclin-dependent kinase family [27]. The potential impact of permanent hair dye components on CDK1 may induce cell cycle arrest at the G2/M checkpoint. In normal breast epithelial tissue, persistent cell cycle arrest—particularly within stem or progenitor cell populations—may promote genomic instability, thereby increasing cancer risk [28]. SRC, the first identified proto-oncogene in mammals and a non-receptor tyrosine kinase, serves as a signaling hub governing proliferation, adhesion, and survival [29]. Elevated or sustained SRC activation is widely recognized as a driver of tumor initiation and progression [30]. The strong binding affinity observed between permanent hair dye components and SRC raises the possibility of modulated kinase activity, which may alter downstream signaling cascades, further influencing breast cell fate.
Although our study predicts high-affinity binding between certain permanent hair dye components and carcinogenic targets, actual toxicity in practical applications depends on multiple factors, including dermal absorption, systemic distribution, and metabolic detoxification processes. Nevertheless, given the frequency and chronicity of hair dye use—often spanning decades—even low-level exposure could lead to bioaccumulation or sustained pathway modulation, meriting careful evaluation.
Limitations
This study provides only preliminary insights into the potential mechanisms by which permanent hair dyes may induce BC, and several limitations remain. First, more epidemiological studies are needed to strengthen the link between exposure to these chemicals and BC incidence. Second, as this study is computational in nature, the predictive results obtained must be validated through experimental methods (such as in vitro binding assays and toxicokinetic verification) to confirm the toxicological effects of permanent hair dyes on humans.
Conclusion
In conclusion, this study combines multiple approaches to investigate the effects of permanent hair dyes on BC. Five carcinogenic chemicals were identified, with DY3 showing the strongest association with BC risk. Four core targets—HSP90AA1, HSP90AB1, CDK1, and SRC—were found to be closely associated with BC. Molecular docking indicated that all five chemicals bind stably to these targets, with DY3 showing the most potent interaction.
Supplementary Information
Supplementary Material 1. Supplementary Material 2. Supplementary Material 3. Supplementary Material 4. Supplementary Material 5.
