From haze to horizon: epigenetic research and artificial intelligence in child and adolescent psychiatry
Yulia Golub, Antje Wulff, Torsten Plösch

Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
- —Carl von Ossietzky Universität Oldenburg (3092)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBirth, Development, and Health · Health, Environment, Cognitive Aging · Epigenetics and DNA Methylation
As decades before, methods for diagnosing child and adolescent psychiatric (CAP) disorders still largely depend on the collection of patient histories by physicians and self-reported questionnaires from patients, with CAP assessments being largely “expert-based diagnostic” and “expert-based treatment” decisions.
This is happening despite significant advances in our understanding of CAP risk and protective factors, environment-genome interactions, and the development of machine learning and artificial intelligence (AI) to help in for decision making.
How can we leverage this knowledge to enhance our understanding of CAP disorders, thereby advancing both clinical practice and scientific progress?
Numerous environmental risk and protective factors for CAP disorders, such as family interactions, peer relationships, school environment, significant life events, parental history, socioeconomic status, early life adversities, and many others have been described [9]. However, the mechanisms of how exactly, different CAP conditions can emerge from common, non-specific environmental influences remains ununderstood. The prevailing belief is that these environmental factors interact with other critical elements, such as a child’s genetic predispositions and developmental stage (i.e., the timing of exposure in relation to the child’s level of maturation) [1] to cause a CAP disorders.
Potential mechanisms through which the genome can ‘capture’ the effects of environmental exposures and propagate their influence are epigenetic processes, which are heritable changes in gene expression that do not involve alterations to the underlying DNA sequence. Of these, DNA methylation is currently the most widely investigated and best understood epigenetic process, involving the addition of methyl groups to DNA, which typically suppresses gene expression without altering the underlying genetic sequence (as reviewed by Greenberg & Bourc’his [5],.
Research has demonstrated (1) that DNA methylation is responsive to environmental factors starting in utero, such as dietary, chemical, and psychosocial exposures [8]; (2) is temporally dynamic, playing a critical role in (neuro)development [10]; (3) and that abnormalities in DNA methylation are linked to a wide range of health outcomes, including psychiatric disorders [2].
Indeed, over the past two decades, epigenetic research in CAP has collected extensive data that could help resolve the question of how environmental influences can result in the development of CAP disorders. One of the many examples are DNA methylation changes following childhood trauma, which are known to predict emotional and behavioral problems after trauma exposure [15]. Another example are methylation signatures mediating environmental impact in the context of polygenic risk leading to Bipolar Disorder in adolescents [7]. This knowledge of the environmental impact and epigenetic mechanisms has the potential to revolutionize and redefine CAP by enhancing primary and secondary prevention, improving diagnosis, and identifying novel treatment targets. And yet, these insights have up to date little practical value, as they do not effectively contribute to diagnostic processes, identify therapeutic targets, or support the development of personalized medicine by treatment tailoring to individual biological profiles.
So why does this scientific knowledge and data collected not advance CAP clinical and research work as effectively as expected?
We identify three fundamental challenges: (1) the limited connection between our clinical labels and the underlying biological processes; (2) the current approach to conducting epigenetic studies, and, last but not least, (3) the way we generate and analyze Big Data in CAP.
Clinical labels
can you Imagine enrolling patients in a clinical study of type 2 diabetes treatment based on how subjects feel after a meal!? But this is the way we largely perform CAP research. There is limited relationship between our clinical labels, such as “depression” or “ADHD” and the underlying biological processes. It is highly unlikely that biologically meaningful connections would arise from studies based on ICD or DSM - labels. Without having biologically determined, distinct phenotypes, how can we expect to find consistent relationships between environmental exposures, epigenetic modifications and behavior? This fundamental issue has been to our opinion well addressed by the Research Domain Criteria (RDoC) initiative of the National Institute of Mental Health. RDoC organize the research of mental illnesses, by looking at dimensions of functioning rather than being tied to categorical diagnoses. It is a dynamic structure that currently focuses on six major domains of human functioning: negative and positive valence systems, cognitive systems, social processes, arousal and regulatory systems, and sensorimotor systems. Contained within each domain are several constructs that comprise different aspects of functions and span from “normal to abnormal” with the understanding that each point on this continuum is affected by environmental and neurodevelopmental contexts [11].
We therefore strongly suggest that more CAP studies apply RDoC in order to connect dimensions of functioning with environmental factors and epigenetic data.
Epigenetic study design
Epigenetic studies in CAP face several methodological challenges, with the most severe being the DNA methylation analysis per se, the establishment of suitable cohorts, and the selection of appropriate sample sources.
DNA methylation can be assessed at different levels, from targeted, gene specific approaches (e.g., pyrosequencing), via “genome scale” approaches (microarray, reduced representation bisulfite sequencing (RRBS)) to full genome wide next-generation sequencing methods. In cohort studies, the current standard is the Illumina 450k or 850k array: It is relatively cheap, has a standardized setup, and there is a wide panel of software available for quality control and data analysis. However, even the large arrays with (roughly) 850k CpG positions only represent a tiny proportion of a typical human epigenome (> 32 million CpG [4]. There is a good chance that we miss important DNA methylation changes because they belong to the roughly 97% of CpG positions not covered by these arrays [12].
Regarding study setup, there are two classic scenarios. Researchers could define a case-control study, which is typically well-characterized with clear clinical definitions. However, the sample size is often small, which leads to the loss of significance when the array data are corrected for multiple testing (i.e., 450 or 850k tests! ). Therefore, the second scenario is often applied, which includes epigenome-wide association studies that are performed involving thousands or hundreds of thousands of participants. This design addresses the issue of multiple testing but introduces bias due to the poorly characterized and heterogeneous sample from multiple study centers, as obtaining comprehensive data on psychiatric conditions, let alone dimensions of functioning, is challenging in large multicenter cohorts.
For obvious reasons studies are usually performed in easy-to-obtain material, for example blood or saliva. However, DNA methylation is in parts cell type specific as it contributes to the specific cellular phenotypes [12]. This implies that data obtained from blood or buccal samples might not represent the situation in our target tissue, somewhere in the brain. Ideally, one needs to test the results in the target tissue as a proof of principle (e.g. Riese et al., [14]. It was proposed to instead focus on “correlated regions of systemic interindividual variation” (CoRSIVs), which do not depend on tissue types [6].
Last but not the least, it is important to consider developmental timing and age effects, as these factors can significantly influence the onset and progression of psychiatric symptoms. For instance, epigenetic changes should be examined before symptom onset, thus providing valuable insights into the early biological markers of mental health issues. Utilizing population-based cohorts with longitudinal assessments of neurobiological changes is essential. By following the same individuals longitudinally from birth, researchers can map the relationship between neurobiological changes, such as epigenetic modifications and psychological symptoms as they unfold over time allowing for a more dynamic understanding of how these neurobiological changes contribute to mental health. Moreover, focusing on broad polyepigenetic signatures can enhance our understanding of complex epigenetic interactions.
Big data in CAP
CAP research may generate large data sets that encompass environmental, neurobiological, and (epi-) genetic data, collected from a wide variety of sources and formats over various time scales [1, 13]. This massive volume of high-variety data, often referred to as Big Data, serves as a key resource for understanding complex patterns and multifactorial relationships. However, such data can no longer be effectively analyzed using simple methods, such as regression equations [3].
What holds promise in this context are artificial intelligence (AI) systems built upon “data-driven models”, which are machine learning algorithms able to learn and draw inferences from patterns in data (e.g. Wulff & Marschollek [16]. However, AI algorithms are only as good as the data that it is trained on.
So, what does an AI system need to integrate CAP Big Data and make it of use for further research and clinical decisions?
First, these data sets have to be digitally captured and managed precisely. In CAP, not only measurable patient data is incorporated to diagnose but also contextual factors on families, the societal environment, and life events, amongst others. Furthermore, “vague data”, such as speech, feelings or mind sets are very important, but difficult to measure. In the best case, AI systems need to incorporate all of this information, requiring these data sets to be digitally captured and managed precisely. As Big Data technologies advance, from long-term storage to real-time data processing, there are more opportunities to reuse data once it has been collected. Furthermore, the increasing adoption of wearable technologies, such as smartwatches, on the one hand, and patient-reported outcome measures and mobile ecological momentary assessments on the other hand facilitates the collection of rich data sets in everyday life and paves the way for utilizing this information to support remote diagnostic capabilities in CAP.
Second, before using routine data as training data basis for AI algorithms, careful data engineering needs to take place. In particular, terminology and ontology enrichments are required due to the varying use of similar terms in CAP. Data needs to be carefully managed, semantically enriched, and standardized. As standardization in health data management, such as interoperability standards for clinical data and communications like Health Level 7 Fast Healthcare Interoperability Resources, Observational Medical Outcomes Partnership, or open standard electronic health records or advanced terminologies like Systematized Nomenclature of Medicine Clinical Terms, becomes more widely adopted, data will be more harmonizable than it is now.
Third, we will face the challenge of dealing with multi-level, multi-dimensional, and potentially unbalanced data accompanied with heterogeneity, uncertainty, and missingness. With the recent advancements in deep learning, the handling of such data sets has been made possible. With the rise of generative AI algorithms, such as large language models, unstructured data is accessible, processable, and analyzable. The use of foundation models is also promising outside of pure language models, in particular for addressing equifinality and multifocality, since they can process and integrate data from multiple modalities, are high-scalable, and provide extensive contextual learning features. Improvements in variational autoencoders, a class of generative models that learn probabilistic mappings between input data and a structured latent space, offer promising techniques for encoding and disentangling different pathways and combinations of risk factors and outcomes. For unbalanced data, e.g. with respects to gender, age, or occurrence of symptoms and diagnoses it is important to raise an awareness on possible data biases included in the routine data to prevent discriminatory decisions. Technologies can be employed to identify and address these issues in the original data sets before further processing.
Conclusions
To integrate our recently acquired knowledge of risk and protective environmental factors, along with their impact at the epigenetic level, into our understanding of CAP pathomechanisms and clinical decision-making, fundamental clinical and research issues must be addressed.
First, we strongly recommend that more CAP studies apply RDoC. This will help overcome the obstacles of non-biologically based clinical labels and connect dimensions of functioning with environmental factors and epigenetic data.
Second, it is essential to conduct better-powered, harmonized, multi-cohort studies, which are needed to adequately capture the development of CAP psychopathologies as well as the outcome of environmental exposures and the time-varying nature of DNA methylation. Such studies will provide more robust and generalizable findings.
AI is capable to work with CAP generated Big Data to uncover complex patterns and multifactorial relationships. However, for AI to be applied, the data must be in a form that AI can work with. The data needs to be digitally recorded, managed, and standardized. The importance of routine and so-called ‘vague data’—such as speech, emotions, and mind sets—cannot be underestimated, making a strong case for their digital capture. Therefore, more studies should incorporate wearable technologies, record speech and utilize mobile ecological momentary assessments.
By implementing these steps, we can one day move forward from expert -based to research- and evidence based decision making in CAP.
