Deep-embedded clustering by relevant scales and genome-wide association study in autism

Fumihiko Ueno; Ippei Takahashi; Hisashi Ohseto; Tomomi Onuma; Akira Narita; Taku Obara; Mami Ishikuro; Keiko Murakami; Aoi Noda; Fumiko Matsuzaki; Hirohito Metoki; Gen Tamiya; Shigeo Kure; Shinichi Kuriyama; Mohith Manjunath; Mohith Manjunath; Mohith Manjunath

PMC · DOI:10.1371/journal.pone.0322698·May 29, 2025

Deep-embedded clustering by relevant scales and genome-wide association study in autism

Fumihiko Ueno, Ippei Takahashi, Hisashi Ohseto, Tomomi Onuma, Akira Narita, Taku Obara, Mami Ishikuro, Keiko Murakami, Aoi Noda, Fumiko Matsuzaki, Hirohito Metoki, Gen Tamiya, Shigeo Kure, Shinichi Kuriyama, Mohith Manjunath, Mohith Manjunath, Mohith Manjunath

PDF

Open Access

TL;DR

This study explores the genetic basis of autism by clustering patients and performing genome-wide association studies, finding multiple significant loci linked to autism-related genes.

Contribution

The novel use of cluster-based GWAS in autism research identifies new genetic loci that may be relevant to ASD.

Findings

01

Cluster-based GWAS identified 27 chromosomal loci associated with autism at p < 5.0 × 10⁻⁸.

02

Several identified loci are near genes previously linked to autism spectrum disorder.

03

Results were not fully replicable across different populations, suggesting potential technical variability.

Abstract

Autism spectrum disorder (ASD) presents with heterogeneous phenotypic and genetic characteristics. Despite investigation into the molecular mechanisms underlying ASD, its etiology remains elusive. In our previous investigation within the Simons Simplex Collection (SSC), we noted increased signals through a genome-wide association study (GWAS) by clustering patients with ASD and reducing the sample size. This study seeks to validate our previous study in a different population, the Simons Foundation Powering Autism Research for Knowledge (SPARK) population, while probing further into the genetic architecture of ASD. We examined data from 2,079 white male subjects and 875 unaffected SPARK siblings. Our methodology encompassed cluster analyses, followed by traditional GWAS and cluster-based GWAS (cGWAS). No significant associations were observed in the conventional GWAS when comparing all…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases3

autism spectrum disorder ASD Autism

Figures3

Click any figure to enlarge with its caption.

Funding1

—Ministry of Education, Culture, Sports, Science and Technology (MEXT) KAKENHI

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutism Spectrum Disorder Research · Genomic variations and chromosomal abnormalities · Genetics and Neurodevelopmental Disorders

Full text

Introduction

Autism spectrum disorder (ASD) is a neurodevelopmental disorder primarily characterized by communication difficulties and repetitive behaviors [1]. Despite efforts to understand the molecular mechanisms underlying ASD, its etiology remains unclear [2]. Evidence suggests that genetic factors strongly contribute to the risk of ASD development [3]. For instance, identical twins exhibit a considerably higher ASD concordance rate of 92% compared with 10% in dizygotic twins [4]. Additionally, the risk ratio for ASD recurrence between siblings is reported to be 22 [5].

Previous genome-wide association studies (GWASs) have identified numerous genetic variants associated with ASD [6,7]. The link between these genetic variants and a single disease can be understood through a polygenic model, wherein the effect of each variant is small but collectively contributes to disease development [8,9]. In the GWAS of a disease, a larger sample size generally aids in identifying more signals, whereas reducing the sample size decreases the number of identified signals. If a GWAS is conducted with a specific sample size and no significant signals are found, it becomes increasingly challenging to identify any signals if the sample size is further reduced by dividing the patients. However, according to a simulation study, the power of a GWAS can be enhanced by dividing patients into more homogeneous populations, regardless of the sample size [10]. In such cases, it may be possible to identify certain signals by clustering patients with similar phenotypes and investigating genetic factors.

In our previous study, we detected more signals by clustering patients with ASD and decreasing the sample size [11]. This finding sharply contrasts the findings of several GWASs and requires careful interpretation and considerable follow-up examination. Thus, we sought to validate our findings using a different dataset in the present study. We aimed to explore the genetic structure of ASD by categorizing patients into clusters based on phenotypic variables and conducting a GWAS (specifically, cluster-based GWAS [cGWAS]), as conducted in our previous study. Moreover, we increased the sample size [12] and introduced a deep-embedded clustering (DEC) algorithm [13,14].

Materials and methods

Participants

This study adhered to the guidelines of the Declaration of Helsinki [15] and all other relevant guidelines. The Institutional Review Board of Tohoku University Graduate School of Medicine (2020-1-826) approved our protocol, and written informed consent was obtained from participants using the Simons Foundation Autism Research Initiative (SFARI) for the SPARK study, which began recruitment on 21/04/2016, and is ongoing [12]. The data were initially obtained on 03/12/2019, for a different study and were accessed for research purposes at 01/10/2021. In SPARK, phenotypic data and biospecimens were collected remotely, enabling participants to fulfill the study requirements online at a convenient time. Individuals in the United States with a professional ASD diagnosis, their parents, and unaffected siblings were eligible to participate in SPARK. Phenotype information and ASD diagnoses in SPARK are self- or parent-reported, and the Interactive Autism Network suggests that parent-reported diagnoses of ASD are highly valid [16].

Datasets

We used phenotypic variables, background history, and genotypic data from the SPARK database, which was publicly released in October 2017 and is directly available from SFARI [12]. From the SPARK WES1 (27 K) dataset, we used data from 2,685 affected white male probands for whom data from all three tests, i.e., Developmental Coordination Disorder Questionnaire (DCDQ) [17], RBS-R [18], and Social Communication Questionnaire (SCQ) [19], were available, and 891 unaffected male proband siblings for whom genotype data were generated by the Illumina Infinium Global Screening Array (GSA) v1.0. ASD is consistently more prevalent in males [2]. Additionally, sex-linked etiology and susceptibility have been reported in autism [20]. Therefore, we focused only on males to exclude sex-related heterogeneity. Of the 2,685 affected probands, we excluded 606 who were biologically related to their unaffected male siblings to eliminate bias due to familial relatedness. Consequently, 2,079 probands and 891 unaffected male siblings were eligible for further analysis. To exclude participants whose ancestries significantly varied, principal component analyses were performed on genotype data using EIGENSOFT version 7.2.1 [21]. Based on these analyses, we excluded 33 individuals whose data points exceeded six standard deviations for principal components 1 or 2. Therefore, 2,062 probands and 875 unaffected male siblings were eligible for clustering and GWAS.

Clustering

We conducted cluster analyses using the phenotypic variables of DCDQ (15 items), SCQ (40 items), and RBS-R (43 items) scoring; age at initial registration in months; self-reported ethnicity; dominant hand; and history of medication, biomedical intervention (e.g., diet, alternative medicine, and supplements), and intensive behavioral intervention (e.g., Applied Behavior Analysis, Verbal Behavior, Pivotal Response Treatment). Missing data were imputed using the mean value of each variable, and all categorical data were transformed into dummy variables.

We applied DEC [13,14], which uses a deep-learning algorithm to conduct cluster analysis using phenotypic variables. The DEC algorithm requires the specification of several parameters, including the number of clusters (k), iterations, epochs, and network dimensions. For this study, we predetermined k to be 40, assuming that ASD consists of hundreds of subgroups [6] and considering the statistical power derived from sample-size calculations [22]. Other hyperparameters for DEC included a batch size of 256,300 pre-training epochs, 400 maximum iterations, 30 update intervals, and 0.001 tolerance for the stopping criterion. These analyses were performed using the scikit-learn toolkit in Python 2.7.

Clustering serves as a technique for exploratory data analysis where the validity of clustering outcomes can be assessed using external knowledge, such as the purpose of segmentation [23]. Several methods have been proposed to predefine the number of clusters (k), including visual examination, likelihood, and error-based approaches. However, it is noteworthy that these methods may not always yield mutually consistent results [24]. Although metrics exist for assessing the quality of clusters [25], the number of clusters should align with the research purposes. Therefore, the inflation factor (λ) of quantile-quantile plots using the logarithm of the p-value to base 10 (−log_10_p) for each cluster was calculated and used as validity indicators.

Genotype data and quality control

We used the SPARK WES1 (27 K) genotypic dataset, in which the probands and unaffected male siblings were previously genotyped (https://gpf.sfari.org/hg38/datasets/SFARI_SPARK_WES_1_CONSORTIUM/dataset-description). We used the dataset genotyped by Illumina GSA v1.0 containing 642,824 probes. We excluded SNPs with a minor allele frequency < 0.01, call rate < 0.95, and Hardy–Weinberg equilibrium test p < 0.000001.

We independently imputed the SPARK WES1 (27 K) genotypic data from the phenotype data using the Michigan imputation server. The human genome reference build of the genotypic data was converted from hg38 to hg19 using LiftOver, a tool provided by the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgLiftOver). On the Michigan imputation server, we selected “HRC r1.1 2016 (GRCh37/hg19)” for Reference Panel, “0.3” for Rsq Filter, “Eagle v2.4” for Phasing, and “Other/Mixed” for population options, and performed quality control and imputation.

After genotype imputation, the SPARK WES1 (27 K) genotypic dataset contained 33,717,335 SNPs on the autosomes.

Statistical analysis

As a preliminary study, we conducted a conventional GWAS comparing all patients with all controls in the entire SPARK WES1 (27 K) genotypic dataset, with 2,062 male probands and 875 unaffected male siblings. The control group did not include the male siblings of the affected participants. Unaffected male siblings were selected from other proband families, including siblings of probands who did not respond to all survey forms and siblings whose female siblings were probands. In the second step, we conducted a cGWAS in each subgroup of cases, which were divided using the DEC algorithm [13,14] and controls. A logistic regression model was used to calculate the additive allele dosage effect.

The GWAS was performed using the PLINK software package [26]. The reported SNPs were annotated using ANNOVAR [27]. Manhattan plots were generated using R software (version 4.1.0; R Foundation for Statistical Computing, Vienna, Austria) [28]. For GWAS, no covariates were adjusted for in this study. Owing to the focus of the present study on males, the sex of children was not adjusted for, whereas age was not included as a covariate in the GWAS because it was used as a variable for clustering. Additionally, variables related to genetic architecture were not corrected for in the GWAS because populations with considerable genetic heterogeneity were excluded based on the SNP array data analysis via principal component analysis. The GWAS results obtained were clumped using linkage disequilibrium, and the locus with the lowest p-value was selected.

Code availability

The computer code used to generate the results is available from the authors upon request. All computer code access inquiries should be sent to Shinichi Kuriyama ([email protected]).

Results

Cluster-based GWAS

In a preliminary study, we conducted a conventional GWAS comparing all patients and controls using the SPARK WES1 (27 K) genotypic dataset and found no significant associations.

The DEC algorithm requires researchers to specify k clusters (k). The average inflation factor λ for a cGWAS with k = 40 was 0.9. Empirically, a threshold for λ considered safe to minimize the risk of false positives is a value less than 1.05 [29]. Therefore, we considered λ < 1.05 as an indicator of successful clustering. We considered cGWAS using cluster analysis with k = 40 as the most appropriate approach for the present dataset. Fixing the hyperparameter k of DEC to 40 eventually led to dividing the dataset into 39 clusters.

The characteristics of each cluster are presented as a heatmap in Table 1. For example, Cluster 5 had relatively high DCDQ [17], Repetitive Behaviors Scale-Revised (RBS-R) [18], average SCQ scores [19], no underlying disease, and a high rate of interventional treatment.

Table 1: Characteristics of the clusters.

Gene interpretation

We observed 27 chromosomal loci that satisfied the threshold of p < 5.0 × 10 − 8. The results of each GWAS analysis with a sample size of more than nine cases following clustering are shown in Table 2. Several loci were identified either within or near the genes linked to the Human Gene module of the SFARI Gene scoring system [6], including RFX3 (score 1, Rare Single Gene Mutation, Syndromic) in Cluster 15; HCN1 (score S, Rare Single Gene Mutation, Genetic Association) in Cluster 18; CSMD1 (score 3, Rare Single Gene Mutation, Genetic Association) in Cluster 23; HIVEP3 (score 2, Rare Single Gene Mutation, Genetic Association) in Cluster 24; and CNTNAP2 (score 2S, Rare Single Gene Mutation, Syndromic, Genetic Association) in Cluster 30. The SFARI Gene scoring system ranges from “Category 1,” which indicates “High Confidence,” to “Category 3,” which denotes “Suggestive Evidence.” Genes associated with ASD-related syndromic disorders are classified under a distinct category, labeled “#S” (e.g., 2S and 3S). Meanwhile, rare single gene variants, disruptions/mutations, and submicroscopic deletions/duplications associated with ASD fall under the category of “Rare Single Gene Mutation.”

Table 2: Association table of the cluster-based genome-wide association study.

Alongside genes from the Human Gene module of the SFARI Gene, our findings also included several other crucial genes previously reported to be associated with ASD and related disorders, listed as follows (Table 3): PLA2G4A in Cluster 28, STMN4 in Cluster 19, PMP22 in Cluster 24, and ADAM12 in Cluster 21, previously associated with ASD [37,47,55–57,61,67,69,76,84]; NCOR2 in Cluster 9, RFX3 in Cluster16, and CEP112 in Cluster 39, previously associated with attention deficit hyperactivity disorder [37,89,90]; UMAD1 in Cluster 17, HCN1 in Cluster 19, PACERR, and KCNAB1in Cluster 28, previously associated with epilepsy [45,46,75,80,81]; KCNAB1 in Cluster 28, previously associated with mental retardation [80]; ADAM12 in Cluster 21, previously associated with Down’s syndrome [57]; PLA2G4A in Cluster 28, CSMD1 in Cluster 24, and ADAM12 in Cluster 21, previously associated with schizophrenia [57,63,64,77]; NCOR2 in Cluster 9 and TAFA5 in Cluster 23, previously associated with depressive disorder [39,59]; HCN1 and MRPS30 in Cluster 19, SMAD3 in Cluster 23, CSMD1 in Cluster 24, and HIVEP3 and NUBPL in Cluster 25, previously associated with Parkinson’s disease [49,51,60,65,70,71,74]; GMPS in Cluster 28 and THBS2 and WDR27 in Cluster 8, CSMD1 in Cluster 24, ADAM12 in Cluster 21, SCARB1 in Cluster 9, AATF in Cluster 31, TAFA5 in Cluster 23, and HCN1 in Cluster 19, previously associated with Alzheimer’s disease [33,35,40,49,57,62,79,83]; RPA3 in Cluster 17, previously associated with Machado–Joseph disease [42]; NUFIP2 in Cluster 39, previously associated with microcephaly [88]; NCOR2 in Cluster 9, previously associated with spinal muscular atrophy [38]; and PMP22 in Cluster 24 previously associated with neuropathy [68].

Table 3: Annotation of the genome-wide significant genes in the present cluster-based genome-wide association study primarily in relation to autism spectrum disorder.

In addition, our findings incorporated some significant genes linked to ASD symptoms identified in previous studies (Table 3), including WDR27 in Cluster 8, previously associated with sleep disturbance [34], and HCN1 in Cluster 19, previously associated with post-traumatic stress disorder [48]. Furthermore, we observed signals in important genes associated with ASD pathways (Table 3), such as RGS3 in Cluster 24, encoding a regulator of G protein signaling [66]; NUBPL in Cluster 25, previously associated with mitochondrial disease [73]; and OR13F1 in Cluster 8, encoding an olfactory receptor [36].

We further observed signals in various genes known to be mutated in cancer (Table 3), including ETV1 in Cluster 7 and HIVEP3 in Cluster 25, previously associated with prostate cancer [31,72]; RFX3-AS1 in Cluster 16, MRPS30-DT and MRPS30 in Cluster 19, and MIR3201 in Cluster 23, previously associated with breast cancer [41,50,52,58]; GMPS in Cluster 28 and ZNF93 in Cluster 37, previously associated with ovarian cancer [78,86]; MIR548H4 in Cluster 19, AATF in Cluster 35, and APCDD1L-DT in Cluster 35, previously associated with lung cancer [53,82,85]; ARL4A in Cluster 7 and RPA3 in Cluster 17, previously associated with glioma [30,43]; ETV1 in Cluster 7, previously related to cranial germinomas [32]; MIR548H4 in Cluster 19, previously associated with head and neck squamous cell carcinoma [54]; RPA3 in Cluster 17, previously associated with gastric cancer [44]; and ZNF682 in Cluster 37, previously associated with Barrett’s esophagus [87].

Replication study

We previously conducted a cGWAS using a dataset from the SSC [11]. We considered the agreement between the SSC and SPARK results to be evidence of successful replication. After SSC data analysis, we found statistically significant single-nucleotide polymorphisms (SNPs) in the cGWAS for the following genes: CDH5, CNTN5, CNTNAP5, DNAH17, DPP10, DSCAM, FOXK1, GABBR2, GRIN2A5, ITPR1, NTM, SDK1, SNCA, SRRM4, and ZNF678. The proteins encoded by CNTNAP5 and ZNF678 exhibited CNTNAP and ZNF domains, respectively. In this study, proteins encoded by CNTNAP2 in Cluster 30 and ZNF93 and ZNF682 in Cluster36 also contained the CNTNAP and ZNF domains, respectively.

Discussion

It is implausible that decreasing the sample size would result in the identification of more significant signals in a GWAS. Therefore, we conducted the present follow-up study. The lack of reproducibility suggests that the cGWAS results may be false positives owing to some technical factors. However, the fact that several signals emerge following clustering and that many of the suggested gene regions are associated with ASD and related diseases may provide supporting evidence for cGWAS validity.

The first point is that the SSC study results were not entirely replicated. In the present and previous studies, only signals in genes encoding proteins with CNTNAP and ZNF domains were consistently observed. Although the variables used in the clustering were different, the limited replication strongly suggests that the signals emerging from the cGWAS approach may be false positives due to technical reasons rather than those inherent in the ASD subgroup.

This study has some limitations. First, although the SPARK cohort is one of the largest genetic cohorts focused on ASD, the limited statistical power of cGWAS to consider variants adequately due to the small sample size cannot be overlooked. It remains uncertain whether clusters with small sample sizes truly represent distinct groups. Moreover, interpreting odds ratios in clusters with insufficient cases was not feasible. Future clarification on the validity of these clusters may be possible with the availability of larger cohorts dealing with ASD and encompassing richer phenotypic information.

Second, the optimal selection of variables, algorithms, cluster numbers, and hyperparameters employed in this study remains unclear. Phenotypic variables chosen included DCDQ, SCQ, and RBS-R scores; age at initial registration in months; ethnicity; dominant hand; and history of medication, biomedical intervention, and intensive behavioral intervention. ASD manifests numerous other symptoms and characteristics [2,91,92]. Therefore, it is essential to carefully examine and narrow down a wide range of variables, considering their relationships in future studies. We selected the DEC algorithm [13,14] for its ability to simultaneously learn feature representations and cluster assignments using deep neural networks, representing one of the most contemporary techniques. Although DEC proves useful, the emergence of alternative algorithms is plausible in the future. A sensitivity analysis was conducted for the set number of clusters and hyperparameters. Although the optimality of the cluster number and hyperparameters in this study remains uncertain, our sensitivity analysis suggests a reasonable degree of validity in the methodology employed. Regarding the number of clusters, we observed that fewer clusters resulted in fewer detected variants, whereas increasing the number of clusters led to the detection of more variants. This is consistent with the hypothesis that cluster-specific variants can be identified by dividing patients with ASD into more homogeneous clusters based on various phenotypes. Changes in hyperparameters, such as batch size, did not significantly alter the results.

As indicated above, the results of the present study suggest that the signals emerging from the cGWAS approach may not be those originally possessed by ASD subgroups. However, some of our results indicate that, to a certain extent, cGWAS could potentially enhance the understanding of ASD pathogenesis. Compared with our previous study [11], two factors were replicated in the present study, which might indicate the validity of the cGWAS. First, several signals emerged due to clustering. This is somewhat consistent with a simulation study, suggesting that more signals are obtained by dividing into more homogeneous clusters [10], although it is unclear whether they were divided into more homogeneous clusters. Second, several gene regions have been suggested to be associated with ASD and related diseases. As suggested by previous studies, along with genes directly associated with ASD, we observed several other genes associated with ASD-related diseases or symptoms [33–36,37,39,40,42,46,48,49,51,57,59,60,62–66,70,71,73–75,77,79–81,83,88–90]. The ASD phenotype overlaps with other conditions, such as attention deficit hyperactivity disorder, epilepsy, mental retardation, Down’s syndrome, schizophrenia, depressive symptoms, Parkinson’s disease, Alzheimer’s disease, Machado–Joseph’s diseases, and post-traumatic stress disorders. Therefore, genes associated with these diseases were also identified in the current study. For example, some Parkinson’s disease-related gene signals may be interpreted as follows: the presence of a certain gene mutation may be observed as an ASD-like symptom in childhood and diagnosed as ASD; however, with aging and cumulative exposure to environmental factors, symptoms may change slightly to a Parkinson’s disease-like phenotype and be diagnosed as Parkinson’s disease. Alternatively, ASD may not be diagnosed during childhood, but Parkinson’s disease is diagnosed in old age. Sleep disturbances and microcephaly are frequently observed in patients with ASD. Dysregulation of G protein signaling, or mitochondrial dysfunction, has also been reported as an etiology of ASD [66,73]. Almost all statistically significant genes in the present study revealed using cGWAS were associated with ASD, its symptoms, and/or its pathways. These findings imply that clustering may be effective for identifying subgroups that share similar underlying disease causes.

Several statistically significant SNPs identified in the present study are reportedly associated with cancer. However, recent research has shown that there is a significant overlap between ASD and cancer risk genes [30–32,41,43,44,50,52–54,58,78,82,85–87]. For instance, ETV1 is associated with prostate cancer; RFX3-AS1, MRPS30, MRPS30-DT, and MIR3201 are associated with breast cancer; MIR548H4, AATF, and APCDD1L-DT are associated with lung cancer. These types of cancer are strongly associated with ASD [93,94]. Regarding the genes involved in ASD and cancer, the list of cancer-associated genes identified in some clusters was almost identical to that of ASD-associated genes. Therefore, it is problematic to explain these findings solely in terms of false positives.

In addition to the two aforementioned replications, the fact that some agreement exists between the characteristics of the clusters and those inferred from gene expression may support the validity of cGWAS. The unique characteristics of each cluster obtained in this study have been illustrated in a heatmap (Table 1). For instance, Cluster 8 was associated with relatively lower DCDQ scores than those of other clusters but higher SCQ and RBS-R scores, with genes such as THBS2, WDR27, and ORF13F1 being associated with Alzheimer’s disease. The characteristics of this cluster, wherein repetitive behavior and social communication deficits were more prevalent and motor skills were relatively preserved, are consistent with some features of Alzheimer’s disease inferred from the functions of these genes. Meanwhile, Cluster 25 exhibited relatively lower RBS-R scores than those of the other clusters but higher DCDQ and SCQ scores, with HIVEP3 and NUBPL being associated with Parkinson’s disease. The characteristics of this cluster were consistent with those of the disease features. In Cluster 7, no major characteristics were observed for DCDQ, SCQ, or RBS-R scores, whereas ARL4A and ETV were associated with cancer. The weakness of ASD characteristics in this cluster may be attributed to their association with cancer. Similar relationships were observed for other clusters. These results suggested an association between cluster characteristics and gene function; thus, the clusters obtained were functionally valid. However, it is crucial to note that the characteristics of clusters may not necessarily be recognizable by humans. Because artificial intelligence extracts features inherent in a combination of many variables, the clusters formed, although being more homogeneous, may not always be easily comprehensible to humans. In other words, artificial intelligence may uncover clusters that humans have not been able to discover. In the future, it will be necessary to define the clusters discovered by artificial intelligence.

Therefore, it is difficult to ascertain whether the present study identified a definitive subgroup, and it would be premature to draw conclusions regarding whether cGWAS has effectively elucidated the pathogenesis of ASD. However, this study revealed that similar to the SSC study, several signals emerge due to clustering, and many of the gene regions suggested herein are associated with ASD and related diseases. Therefore, completely ruling out cGWAS may be unwarranted, and further research is required. We plan to apply cGWAS to other datasets of ASD and other diseases. It may not be entirely futile for more researchers to conduct genetic searches based on cGWAS or ASD subgrouping. In doing so, ensuring reproducibility across cohorts, and using interpretable models less prone to overfitting is paramount, especially in small cohorts, for applying robust modeling strategies that account for heterogeneity. Moreover, for GWASs that already result in a large number of signals, it may be possible to determine whether the signals can be separated by dividing them into clusters. ASD subgroup identification using a proper classification may be important. This may partially lead to the development of precision medicine for ASD and other multifactorial diseases.

Bibliography94

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 5th ed. DSM-5. Washington, District of Columbia: American Psychiatric Publishing; 2013. doi: 10.1176/appi.books.9780890425596 · doi ↗
2Lord C, Elsabbagh M, Baird G, Veenstra-Vanderweele J. Autism spectrum disorder. Lancet. 2018;392:508–520. 10.1016/S 0140-6736(18)31129-2 30078460 PMC 7398158 · doi ↗ · pubmed ↗
3Geschwind DH, State MW. Gene hunting in autism spectrum disorder: on the path to precision medicine. Lancet Neurol. 2015;14:1109–1120. 10.1016/S 1474-4422(15)00044-7 25891009 PMC 4694565 · doi ↗ · pubmed ↗
4Bailey A, Le Couteur A, Gottesman I, Bolton P, Simonoff E, Yuzda E, et al. Autism as a strongly genetic disorder: evidence from a British twin study. Psychol Med. 1995;25:63–77. doi: 10.1017/s 0033291700028099 7792363 · doi ↗ · pubmed ↗
5Lauritsen MB, Pedersen CB, Mortensen PB. Effects of familial risk factors and place of birth on the risk of autism: a nationwide register-based study. J Child Psychol Psychiatry. 2005;46: 963–971. doi: 10.1111/j.1469-7610.2004.00391.x 16108999 · doi ↗ · pubmed ↗
6Gene Scoring Module. SFARI gene. Available from: https://gene.sfari.org/database/gene-scoring/
7Grove J, Ripke S, Als TD, Mattheisen M, Walters RK, Won H, et al. Identification of common genetic risk variants for autism spectrum disorder. Nat Genet. 2019;51:431–444. doi: 10.1038/s 41588-019-0344-8 30804558 PMC 6454898 · doi ↗ · pubmed ↗
8Visscher PM, Goddard ME, From RA. From R.A. Fisher’s 1918 Paper to GWAS a Century Later. Genetics. 2019;211:1125–1130. doi: 10.1534/genetics.118.301594 30967441 PMC 6456325 · doi ↗ · pubmed ↗