Variability of transposable elements in six genetic isolates from North-Eastern Italy and their relationship with alcohol consumption, tobacco use and BMI
Giorgia Modenini, Giacomo Mercuri, Paolo Abondio, Giuseppe Giovanni Nardone, Aurora Santin, Paola Tesolin, Beatrice Spedicati, Alessandro Pecori, Giulia Pianigiani, Maria Pina Concas, Giorgia Girotto, Paolo Gasparini, Alessio Boattini, Massimo Mezzavilla

TL;DR
This study explores how transposable elements in the genome vary among six isolated Italian villages and their links to behaviors like alcohol and tobacco use, as well as BMI.
Contribution
The study identifies polymorphic transposable elements in genetic isolates and links them to behavioral traits and BMI, suggesting their role in phenotypic variation and disease susceptibility.
Findings
12,709 polymorphic transposable elements were identified in 586 individuals from six isolates.
Several transposable elements in constrained genes were significantly associated with tobacco use, alcohol consumption, or BMI variations.
The isolates cluster together genetically and show drift-induced ancestral components despite being related to other European populations.
Abstract
Half of the human genome is derived from Transposable Elements (TEs), among which Alu, LINE-1 and SVA are particularly represented. Germline transposition of TEs generates polymorphisms between individuals and may be used to study association with phenotypes and inter-individual differences. Italy presents an increased number of isolated villages compared to other European groups, and these isolates provide a desirable study subject to help understanding the genetic variability of the Italian peninsula. Therefore, we focused on the relationship between polymorphic TEs, behavioral traits (tobacco use and alcohol consumption), and Body Mass Index (BMI) variations, which could lead to an increased risk of developing addiction-related or metabolic diseases. We identified 12,709 polymorphic TEs in 586 individuals from six isolates: classical population genetics analyses showed that while…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3- —Università degli Studi di Padova
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChromosomal and Genetic Variations · Genetic Mapping and Diversity in Plants and Animals · Genomic variations and chromosomal abnormalities
Introduction
The Italian peninsula, due to its complex population structure, could play an important role in the understanding of the genetic diversity of current populations, being the natural crossroad for human migrations across the Mediterranean since prehistoric periods. These migration patterns left a tangible mark on present-day Italians, revealing a heterogeneous network of genomic landscapes across the peninsula, with North Italian groups being more closely related to Western/Eastern European populations and a progressively increasing genetic connection with Northern African and Middle Eastern populations as we move southwards [1]. On top of this clinal variation across the peninsula, the natural variety of environments [2] provoked a series of local adaptive events that determined, among other factors, a differential disease susceptibility of Italian subpopulations [1]. A refined understanding of these local events would improve our knowledge of human diversity as a whole, and on a more practical level allow us to provide more ad hoc medical care and measures to particularly susceptible subpopulations.
The underlying genetic variability of Italy remains under-sampled and underrepresented, with available human genome reference datasets such as the 1KGP, HGDP, and SGDP only sampling three populations for the whole peninsula: Tuscans (TSI, 113 individuals), Bergamo (14) and Sardinians (28), a notion that only worsens when considering that Italy presents an increased number of historically isolated villages and subpopulations when compared to other European groups [3, 4], most of which remain uncharacterized.
These groups provide a desirable study subject to understand the Italian genetic variability: population isolates are characterized by small effective population sizes (Ne), which result in a decreased variability and stronger genetic drift effects, potentially increasing the frequency of variants that are rare or absent elsewhere and aiding at the discovery of novel rare variant signals underpinning complex traits such as medical risks and susceptibilities [5]. Population isolates tend not only to be genetically homogenous but are also characterized by an elevated diversity when compared to neighboring populations and their source population [3], because of geographical and/or cultural barriers that are necessary for the formation of the isolate in the first place. For these reasons, isolates can be useful tools for genome-wide association studies [6].
However, most of the available research on these populations is based almost exclusively on SNP data, while little work was done using other types of genetic markers. For instance, information about the variability of Transposable Elements (TE), despite them being a primary component of the human genome, has become accessible only in recent years, thanks to the availability of whole-genome sequencing data and in particular to the development of new tools for their detection and genotyping [7–9]. When TEs transpose in the germline, they can create novel inheritable insertions, thereby generating human-specific polymorphisms [7]. One of the most useful features of polymorphic TEs is that the ancestral state of these markers is known to be the absence of the insertion [10, 11]. Interestingly, such markers have never been used to study the genetic underpinnings of human isolated communities; therefore, this study is the first of a kind.
In the last decades, we have come to know much more about the impact of these elements on the genome and gene networks, and it has been shown that TE insertions can generate diversity in a variety of ways. For example, transposable elements have been linked to providing polyadenylation signals inducing the termination of transcripts [12], modifying splicing patterns, and providing new splicing sites [13], epigenetically affecting nearby genes [14, 15], acting as novel promoters, enhancers, and transcription factor building sites [16, 17], and often carrying their enhancers and promoters [18]. With their innate ability to act as disruptors and deregulators of gene expression, TE insertions have been associated with a variety of human diseases: for example, several cancer types [19, 20], hemophilia A and B [21, 22], some inheritable genetic diseases such as Dent’s disease or Duchenne muscular dystrophy [23], metabolic diseases [24], substance abuse, and central nervous systems diseases [25].
In particular, much interest has been given in recent years to the impact of transposable elements on the central nervous system [25–27]. Genome-wide approaches allowed researchers to study the role of transposable elements in stress-related learning mechanisms in rats [28], which have been used as a model for PTSD in humans [29]. Likewise, transposable elements have also been associated with alcoholism in humans using the same genome-wide approach [25].
In this study, we aim to reconstruct the TE variability of six isolates from Friuli-Venezia Giulia (North-Eastern Italy) thanks to the availability of whole sequencing data from 589 individuals [3]. Firstly, after determining the position and the genotypes of polymorphic TEs in these populations, we use them to evaluate the isolates’ structure, in the context of European and worldwide reference populations. Then, leveraging on the advantages offered by genetic and geographic isolates, we focused on exploring the potential association between non-reference polymorphic TEs, Body Mass Index (BMI) variations and behavioral traits of health and social relevance such as tobacco use and alcohol consumption. In fact, these traits could lead to an increased risk of developing addiction-related or metabolic diseases, such as tobacco use disorder, alcohol use disorder, and obesity [30–32].
Methods
Genomic and phenotypic data selection
The dataset used in this study was generated in 2008 [3–5] from the sampling of 611 individuals from six geographically and historically isolated villages in the Friuli-Venezia Giulia region of North-Eastern Italy, namely Sauris, Illegio, Resia, Erto, Clauzetto and San Martino del Carso (Fig. 1). As described in Cocca et al., 2020 [4], samples were “genotyped using the Human370CNV according to the manufacturer’s protocol (Illumina Inc., San Diego, CA, USA). The published data sets used in the analysis had been genotyped with different versions of Illumina beadchips”. Since a few of the individuals present in the dataset were duplicates (specifically, 22 individuals from Resia), and three missed village information, they were removed, leading the total number of analyzed individuals to 586. During the sampling, subjects were asked to fill out an anamnesis form to acquire more data on their general health and lifestyle habits. Phenotypic data on more than 70 traits was collected, also including food preferences, olfactory perception, gustatory perception and anthropometric measures. Since the form was administered at individual discretion, missing rates vary wildly between phenotypes and individuals. We chose to focus solely on phenotypes exhibiting a missing rate of less than 10% in our dataset, thus the traits included in our analysis were sex, age, alcohol consumption, smoking, as well as height and weight, from which we calculated the corresponding BMI (weight/height^2^). Phenotypes linked to specific diseases or health conditions, such as the occurrence of diabetes, displayed a missing rate exceeding 40%, and as such we chose to not include them in our analyses.
Fig. 1. Location of the six isolates in Friuli-Venezia Giulia, north-east of Italy (SMC = San Martino del Carso)
Transposable elements’ variant calling and filtering
Genomes were scanned in search of non-reference polymorphic TEs (Alus, LINE1s, and SVAs), using the Mobile Element Locator Tool (MELT) v2.2.2 [8]. The WGS data was aligned with bwa [33] to the human reference Human GRCh38, and the aligned reads were used as input for MELT. For the calling process, we used the MELT mobile element reference sequences and the collection of insertion sites discovered in Phase III of the 1000 Genomes Project (1KGP) as analysis priors. After the identification of these TEs, a self-customized Python script was applied to the resulting vcf files to calculate both allele and genotype frequencies of each TE for all the isolated villages. Allele frequencies were then analyzed for significant differences between villages with Fisher’s exact test, using a significant threshold of nominal p-value < 0.01 (“differentiated” TEs).
MELT provides gene names in RefSeq format: therefore, RefSeq accession numbers were converted to their respective Official Gene Symbol using the Database for Annotation, Visualization and Integrated Discovery (DAVID) (https://david.ncifcrf.gov/) [34], taking into consideration the specific gene region TEs were inserted in (Intron, Exon, Promoter, Terminator, 5’ UTR and 3’ UTR).
To compare TE diversity of the isolates with other human populations, we built a new dataset consisting of polymorphic TEs identified with MELT [8] that were present both in the six isolates and in the populations of the 1000 Genomes Project. This newly merged dataset contained a total of 2,814 genetic loci for 3,090 individuals from 32 populations. The populations were divided into 6 groups based on geographic macro areas, consistent with the super populations of 1KGP [35], i.e. Africa (AFR), America (AMR), East Asia (EAS), Europe (EUR), South Asia (SAS), plus the isolates from Friuli-Venezia Giulia (FVG).
In order to perform downstream analyses, TEs were coded as single nucleotide variants. In particular, since the original MELT output codes the absence of a TE with a single base, we substituted the presence of the insertion with another nucleotide base that was non-complementary to the previous one, according to the following pattern: A/T → C; C/G → A. Information about the true nature of each insertion was kept in the original vcf file. Variants were then filtered with PLINK v1.9 [36] as follows: (1) Removal of insertions located on sexual chromosomes or mitochondrial genome insertions, to retain only autosomal variability and removal of duplicates, using the --exclude option. (2) Exclusion of individuals and variants with >1% missing data with the commands --geno 0.01 (for variants) and --mind 0.01 (for individuals). (3) Removal of variants that did not respect the Hardy-Weinberg Equilibrium (HWE) with the option --hwe, setting a significant threshold of 0.01 using a Bonferroni Correction for multiple testing (threshold = 0.01/number of variants). This was done to ensure the removal of potential genotyping errors, a common procedure in large genomic datasets [37]. (4) Removal of variants with a minor allele frequency < 0.01 (--maf 0.01). (5) Removal of closely related individuals with an Identity by Descent (IBD) estimate higher than 0.25, using the --genome option to calculate the pairwise IBD estimates between every couple of individuals and --remove to exclude one of the two related individuals. Therefore, the final filtered dataset was made of 1,703 variants shared among 3,087 individuals.
Exploration of population structure
The generated dataset was then used to perform a series of analyses on TE insertions from the six isolates when compared to 1KGP groups. Both a Principal Component Analysis (PCA) and Admixture analysis were applied: PCA was performed after the conversion from the PLINK format (bed,* bim*,* fam*) with the convertf and smartpca tools of the EIGENSTRAT v6.0.1 package [38]. Admixture was implemented with the ADMIXTURE tool [39], testing between 2 and 23 potential ancestry components (K) and performing 50 iterations of each run to minimize the estimation error and maximize the log-likelihood of each ancestry estimate.
We then compared FVG isolates with other European populations, subsampling the original 1KGP dataset as follows: Utah residents with North-Western European ancestry (CEU), Finnish in Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), Tuscans in Italy (TSI). PCA and Admixture analyses were implemented using the above approach, the only difference being that we tested a number K of putative ancestry components between 2 and 12. Finally, we used TreeMix [40] to construct a phylogenetic tree based on mobile elements’ variability and using representative populations from Europe, as well as the Yoruba as an outgroup.
Evaluation of genes under genetic constraint using GEMMA
As introduced in the “background” section, individuals were asked to fill out an anamnesis form, including information on their health status and lifestyle habits. With the aim of exploring the relationships between polymorphic TEs and specific phenotypes or traits from our isolates, we focused on the following: tobacco use, alcohol consumption, and body mass index (the latter was calculated as weight/height^2^). Being aware that our dataset lacks the power to perform a full-scale genome-wide association study, we focused our attention on a set of constrained genes. In particular, we collected measures of genetic constraints such as pLI (probability of loss of function intolerance) and missense Z score [41], RVIS (Genic Intolerance) [42] and SSC score (Singletons Score) [43] for prioritization. We considered as constrained those genes with pLI >0.9 or Missense Z score >95th percentile of the genomic distribution or RVIS < 5th percentile of the genomic distribution or SSC score < −2, finally resulting in a total of 4067 genes analyzed.
Then, we performed a series of tests using the software GEMMA [44, 45] by applying for all the considered phenotypes a univariate linear mixed model (uvLMM) for tests between a marker, a chosen phenotype, and any chosen covariates, while also correcting for the potential presence of population stratification (indeed a typical feature of isolates), and estimating genetic correlation among phenotypes [45]. GEMMA was applied to the full FVG dataset (12,709 TEs and 586 individuals) and three separate uvLMM analyses were performed, using sex and age as covariates: (1) BMI; (2) a binary alcohol drinker/non-drinker variable (set as “1” for drinker individuals and “0” for non-drinkers); (3) a binary smoker/non-smoker variable (using “1” for smokers and “0” for non-smokers). A fourth test on the smoker individuals was performed to evaluate the possible association between polymorphic TEs and the number of cigarettes smoked per day/number of years smoking. Then we selected only the transposable elements that fulfil the following criteria: they should map inside a gene, and that gene should have at least two indicators of genetic constraints; the resulting number of Gene/Alu used for the association analyses was 105.
Then TEs were tested using Wald’s test with a significant threshold of p-value < = 0.05/105 (0.000454).
Haplotype-based analyses
To better assess the importance of the identified TEs, we also performed a haplotype reconstruction/association test procedure on the significant variants from the alcohol and smoking tests detected with GEMMA using Beagle [46], choosing these traits as Beagle only performs association tests on binary variables. First, we selected regions of interest (10 kb upstream and downstream the significant TE, for a total of 20 kb) with VCFtools [47] and phased those regions with the software Beagle v5.1. The obtained vcf files were converted into the Beagle format with vcf2beagle (https://faculty.washington.edu/browning/beagle_utilities/) and the case status “smoking” or “alcohol” was included in the second row of the bgl files. Lastly, the association test on the reconstructed haplotypes was performed with Beagle v3.3.2 and the significant results were checked with the cluster2haps utility. In order to investigate a possible function for the identified TEs, we then cross-checked the significant results with the lists of polymorphic TEs acting as expression/alternative splicing quantitative trait loci produced by Cao and colleagues [48].
Results
TE variation distribution
After the analysis of polymorphic non-reference TEs with MELT v.2.2.2 [8], a total of 9,525 Alus, 2,283 LINE1s, and 901 SVAs were retrieved.
Then, allele frequencies were scanned for significant differences among the isolates: this way, a total of 3,987 TEs (31.37%) were identified as “differentiated”, of which 3,195 Alus (33.54%), 636 LINE1s (27.86%), and 156 SVAs (17.31%). When considering all comparison European populations, the corresponding rates of “differentiated” TEs are 53.45% (Alus), 58.63% (LINE1s) and 51.24% (SVAs).
Of these insertions, we also considered their location (Table 1).
Table 1. Significantly different polymorphic TEs between the six villages, divided by insertion location relative to gene region (with percentages) and TE superclassAluLINE1SVAINTRONIC1,281 (40,1%)242 (38%)65 (41,7%)PROMOTER138 (4,3%)23 (3,6%)10 (6,4%)TERMINATOR106 (3,3%)28 (4,4%)11 (7%)EXON38 (1,2%)6 (1%)6 (3,8%)3’-UTR39 (1,2%)6 (1%)3 (1,9%)5’-UTR18 (0,6%)2 (0,3%)0INTERGENIC1,575 (49,3%)329 (51,7%)61 (39,1%)TOTAL3,195636156
As expected, most polymorphic TE insertions are located in intronic and intergenic regions and only a negligible fraction are located in exonic regions (Supplementary Table S1). However, it is interesting to note that SVAs, which can be up to 3 kb long [49], are overall less frequent in intergenic sequences while they appear more often located in “functional” regions (regulators or exons) when compared to Alus and LINE1s. This finding corroborates the notion that SVA insertions have the innate potential to regulate gene expression through their location insertion and their sequence characteristics [50, 51].
TE as markers for population structure
Both TE-based PCA and Admixture show that, while closely related to other European populations, our isolates tend to cluster amongst themselves and are dominated by drift-induced ancestry components (Supplementary Figure S1). In particular, the first PC discriminates between African and non-African populations, while the second PC highlights a West-to-East geographical pattern including individuals from Friuli-Venezia Giulia, Europeans, Americans, South Asians, and East Asians.
The PCA including only European and FVG populations divides the two groups along the first PC, while the second component highlights the variability between the isolates, separating Resia and some individuals from Clauzetto and Sauris from the rest (Fig. 2A). As expected considering their geographical proximity and historical relatedness, Tuscans (TSI) and Central Europeans (CEU) are the closest groups to the FVG isolates. This PCA is similar to the one resulting in Esko et al. [3] based on SNPs. Looking at the second and third PCs, it is interesting to note that PC2 separates Resia from Clauzetto, while the third component highlights the differentiation between Sauris and Illegio. Instead, Erto, San Martino and most individuals from Clauzetto cluster together with the other European populations (Fig. 2B), hence suggesting a lower degree of isolation for these groups. Finally, looking at the Admixture graph (Fig. 2C; Supplementary Figure S2), models start “tidying” up at K = 7, with Erto and Clauzetto sharing their dominating ancestry, and reach the most supported model at K = 9 (Fig. 2C; Supplementary Figure S2; Supplementary Table S2). However, the “tidiest” model is for K = 8, as results for K = 9 and onward present excessive noise (Supplementary Figure S2), adding further ancestry components to European populations and to the African outgroup, which are unnecessary or even confounding for FVG populations. Interestingly, at K = 8, all isolated villages are dominated by their own ancestry component; on the contrary these village-specific components are present only marginally in the other European populations. Accordingly, the phylogenetic tree obtained with TreeMix (Supplementary Figure S4), from one side recapitulates well known relationships among the considered populations, from the other it shows that all FVG isolates cluster together. When migration edges were taken into consideration, the tree with 5 edges (explaining 99.9% of variation) showed no signals of gene flow from European populations to the Friulan isolates (Supplementary Figure S5).
Fig. 2A and B) PCA plots of European populations from 1KGP and FVG isolates, first against second component (A) and second against third component (B). C) Admixture barplots for K = 7–9
Association between TEs in constrained genes and selected phenotypes
We focused on genes under evolutionary constraint, based on the hypothesis that transposable elements (TEs) are more likely to influence phenotypes when inserted in genes subject to purifying selection. Therefore, we analyzed only genes that showed evidence of constraint according to at least two of the following metrics: pLI [41], missense Z-score [41], RVIS [42], or SSC score [43]. We found that some polymorphic TEs (Fig. 3) are possibly associated with the conditions detailed in Materials and Methods, and some of them also act as eQTLs/sQTLs. More in detail:
- Variations in Body Mass Index: three insertions were deemed significant, namely two Alus and one SVA). Notably, the SVA on chr17:49150166 is located in the gene SPAG9 (Sperm Associated Antigen 9), labelled as constrained by pLI, missense Z score, SSC score and RVIS (Fig. 3; Table 2).
- Alcohol consumption: one Alu was found to be significant in a genic region, the Alu on chr12:14020945 in the gene GRIN2B (Glutamate Ionotropic Receptor NMDA Type Subunit 2B; Fig. 3). This TE was previously identified as “differentiated” among the isolates and is generally widespread in our six villages (Table 2).
- Tobacco use (smoking): only the Alu on chr12:129970510 in TMEM132D (Transmembrane Protein 132D), labelled as constrained, was found associated (Fig. 3). This Alu is mostly widespread in the six considered villages (Table 2) and was identified as “differentiated” between the isolates (Table 2).
- A further test with GEMMA was performed on the “smoking” condition by taking into account the number of cigarettes smoked per day and the number of years smoking.). Significant results include the Alu on chr12:123580101, located in the gene PITPNM2(Phosphatidylinositol Transfer Protein Membrane Associated 2); and the Alu on chr18:29519986, located in the gene *TRAPPC8 *(Trafficking Protein Particle Complex Subunit 8) (Figure 3). The full GEMMA output for each association analysis, including per trait effect size, variance, standard error, and associated p-value of each significant insertion, is reported in Supplementary Table S3.
Absolute genotype frequencies of these insertions are reported in Table 2. The Alus in the genes SPAG9,* PITPNM2*, and TRAPPC8, despite being significant, appear to be rare, therefore we did not report the genotype frequencies in Table 2 (the percentage of individuals who carry the insertion is 1.9% for SPAG9, 0.18% for PITPNM2, and 0.26% for TRAPPC8).
We finally reconstructed haplotypes around the above mentioned TEs and performed haplotype-based association tests as described in Methods. We obtained two significant results, both for the alcohol phenotype, namely the two intergenic Alus on chr6:1257163 and chr6:161283170. In both cases, the associated haplotype is characterized by the presence of the mobile element. The first haplotype included 19 SNPs and one Alu (p-value = 0.00164); the second is a haplotype with 7 SNPs and the TE (p-value = 0.000335).
Fig. 3. Results for association tests in constrained genes. Genes are divided by tested traits (BMI variation, alcohol consumption, smoking, and number of cigarettes smoked per day/years smoking). Balloon size is proportional to the average constraint score of genes inside of it
Table 2. Absolute genotype frequencies of the two alus located in the genes GRIN2B, and TMEM132DResiaErtoIllegioSaurisSan MartinoClauzetto0\00\11\10\00\11\10\00\11\10\00\11\10\00\11\10\00\11\1 GRIN2B 536030273062337173539115270832515 TMEM132D 7557112230114231445382466717383812
Discussion
The study of isolated communities is at the basis of population genetics research [52, 53]. In fact, isolates yield genomes that show high homogeneity and are subject to similar environmental and cultural pressures, such as lifestyle habits, diet, sanitary conditions, and disease vectors. These populations also can be of help to study the phenotypic effects of variants that were otherwise only marginally present in larger populations [53]. In this picture, Italian isolates are particularly important, mainly because of the peninsula’s central role in human migrations since prehistoric times and of the high number of genetically distinct isolated communities that have been established throughout history [54]. Polymorphic TEs, which have previously been used as both variability and susceptibility markers only in “general” populations [7, 8, 25], are here applied for the first time to human isolates. Using the Mobile Element Locator Tool [8] more than 12,000 polymorphic TEs were identified in the six villages of Friuli-Venezia Giulia. These TEs were used as genetic markers to obtain a first overview of their potential impact on diversity and disease susceptibility in isolated populations, in particular: (1) to study communities’ differentiation; (2) to explore the genetic variability of the isolates; (3) and to analyze their possible role as genetic variants underlying susceptibility to different behavioral traits or medical conditions (tobacco use, alcohol consumption, and BMI variations).
Firstly, after calculating allele and genotype frequencies of the identified TEs, we found that of 12,709 TEs, 3,987 (31.37%) have significantly different allele frequencies between the six isolates (Fisher’s exact test, p-value < 0.01), while the corresponding rate in European comparison populations is 53.78%. Considering the much lower geographic dimensions of FVG compared to Europe, these values suggest the presence of genomic structure among the isolates.
Then, TEs were used as markers for exploratory population analyses, such as PCA, Admixture and TreeMix, to look at the general diversity and ancestry of FVG isolates in the context of European genetic variability, as represented by the polymorphic TE content of European populations from 1KGP [8]. Our results show that FVG isolates tend to cluster amongst themselves (PC1 in Fig. 2A and C, phylogenetic tree in Supplementary Figure S4), compared to European populations; however some differentiation between the isolates is evident, particularly for Resia and some individuals from Clauzetto (PC2 in Fig. 2B and C), as well as Sauris and Illegio (PC3 in Fig. 2B and C). Instead, Erto, San Martino and most individuals from Clauzetto overlap with the other European populations. These results agree with previous SNP-based studies, according to which, Clauzetto is the least isolated village among the six FVG isolates [3]; at the same time Clauzetto, Erto and San Martino have the lowest inbreeding coefficients among the villages [4]. The observed patterns of genetic variability and ancestry components could be explained by population structuring and genetic drift, a suggestion made also by previous works on the same dataset [3–5]. The observation of a strong correlation between SNP-driven results and TE-driven results in terms of population structure further highlights that the variability of polymorphic TE is mainly the result of demographic events.
To sum up, population structure analyses confirmed that on the whole, our populations show the typical marks of isolates also from the TEs point of view. As previously mentioned, due to their internal homogeneity both at genetic and social levels, they may be of help to perform genome-wide association studies. On the other hand, their relatively low census size implies a moderate number of available samples [and makes highly unlikely the availability of replication cohorts]. More importantly, the presence of population structure is well known to induce false positives in association studies. However, the impact of the observed structure is probably moderate or at least not higher than in association studies at a country level, as suggested by exploratory population analyses (PCA, Admixture) and proportions of “differentiated” TEs. In addition, the usage of GEMMA should overcome distortions due to population structure, as confirmed by the fact that only a minority (3/22) of the associated variants show significant differentiation among isolates.
In this context, polymorphic TE insertions are particularly worthy of investigation, being potential risk variants for several medically relevant phenotypes, because of their innate ability to act as deregulators of gene networks [15]. Notably, the link between transposable elements and the health of the Central Nervous System is not new [26, 27], with the effects of TEs being associated with stress, neurodegeneration, ageing, and drug abuse [25]. As such, TE markers can allow us to perform a first exploration of the medical susceptibility of individuals from the studied villages, by testing for association between TEs and phenotypes linked to behavioral and anthropometric traits.
Since the characteristics of our isolates (small sample size and lack of a replication cohort) do not allow to directly perform a “classic” genome-wide association study, we adopted a prioritization approach using only genes under putative purifying selection, followed by an association-like analysis to perform an initial exploration of this potential connection. Accordingly, we used GEMMA [44, 45] in order to obtain a first overview of the polymorphic TEs in constrained genes that could underpin the variability of selected phenotypes, i.e. tobacco use, alcohol consumption, height and weight, from which we calculated body mass index (weight/height^2^). For tobacco use, two separate analyses were run, the first comparing smokers with non-smokers, the second only on smoker individuals, testing for the association between the number of cigarettes smoked per day and the number of years smoking. In addition, sex and age were introduced in the models as covariates. Several TEs were deemed significant, some of which are located in known genes, as shown in Fig. 3: an SVA (chr17:49150166) in the gene SPAG9 (BMI variations); the Alu on chr12:129970510 in TMEM132D (tobacco use/smoking) and the Alu on chr12:14020945 in the gene GRIN2B (alcohol consumption). Interestingly, the alcohol consumption phenotype was deemed significant also for the haplotype-based association test performed with Beagle [46], and the haplotype including the polymorphic TEs appears as significantly linked with the status “alcohol drinker”. As for the amount of cigarettes/number of years smoking, it resulted as of interest two Alu insertions: one on chr12:123580101, in the PITPNM2 gene, and one on chr18:29519986, in the TRAPPC8 gene. Additionally, the insertions in TMEM132D and GRIN2B were also identified as “differentiated” when looking at genotype and allele frequencies between the isolates. For instance, the gene GRIN2B encodes a member of the ionotropic glutamate receptor superfamily and plays a major role in brain development and synaptic plasticity, with mutations in this gene often associated with neurodevelopmental disorders [55]. In addition, variants of this gene have been associated with alcohol and tobacco consumption [56], general risk-taking behaviors [57], opioid dependence [58], and several neurological disorders such as schizophrenia [59] and Alzheimer’s disease [60]. Furthermore, TMEM132D, encoding for a transmembrane protein, has already been associated with many neurological disorders such as anxiety and panic disorders [61] and general behavioral disinhibition, including alcohol consumption and dependence, illicit drug use, and nicotine use [62]. There is no strong, widely replicated GWAS evidence that common variants in TRAPPC8 and PITPNM2 are a major, direct driver of cigarette quantity. Despite TRAPPC8 is expressed in brain tissues and is a TRAPP complex subunit involved in vesicle trafficking/autophagy, no clear evidence in the literature highlights its possible role in smoking behaviour, despite future focus on rare variants could shed some light on it. On the other hand PITPNM2 is involved in brain responses to drugs (changes in expression/DNA methylation and some exploratory signals), which makes it a plausible candidate in addiction-related biology [63]. A future step would be looking into possible pleiotropic effects of these genes with common actors in smoking behaviors.
Conclusions
Polymorphic transposable elements emerge as a compelling avenue for elucidating human genetic diversity. The innovative use of polymorphic TEs as markers for genetic variability within isolated communities represents a promising methodological advancement. This study demonstrates the utility of polymorphic TEs in effectively encapsulating genetic variability and historical contexts among isolates, substantiated by congruent outcomes with prior investigations relying on single nucleotide variants [3–5]. While progress has been made, the comprehensive impact of transposable elements on the human genome remains incompletely understood, as does the cascade of effects on diverse phenotypes. This investigation identifies numerous TE insertions correlated with specific phenotypes, such as substance use and metabolic disorders. It is imperative to underscore the exploratory nature of our analyses, necessitating further empirical validation to establish definitive causal links between these insertions and medical susceptibility. Nevertheless, the identified insertions stand as pivotal points of interest, providing a foundational platform for subsequent research. Consequently, prospective studies should prioritize the validation of identified variants and engage in selection analyses to discern potential instances of natural selection within these isolated populations. This forward-looking research agenda holds significant promise for advancing our understanding of the intricate interplay between transposable elements and human phenotypic traits.
Supplementary Information
Supplementary Material 1
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Watkins WS, Feusier JE, Thomas J, Goubert C, Mallick S, Jorde LB. The Simons Genome Diversity Project: A Global Analysis of Mobile Element Diversity. Schaack S, editor. Genome Biology and Evolution [Internet]. 2020 [cited 2023 Oct 24];12:779–94. Available from: https://academic.oup.com/gbe/article/12/6/779/582822110.1093/gbe/evaa 086PMC 729028832359137 · doi ↗ · pubmed ↗
- 2Le Foll B, Piper ME, Fowler CD, Tonstad S, Bierut L, Lu L et al. Tobacco and nicotine use. Nat Rev Dis Primers [Internet]. 2022 [cited 2025 Sept 14];8:19. Available from: https://www.nature.com/articles/s 41572-022-00346-w 10.1038/s 41572-022-00346-w 35332148 · doi ↗ · pubmed ↗
- 3Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics [Internet]. 2007 [cited 2023 Oct 24];81:559–75. Available from: https://linkinghub.elsevier.com/retrieve/pii/S 000292970761352410.1086/519795 PMC 195083817701901 · doi ↗ · pubmed ↗
- 4Pickrell JK, Pritchard JK. Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data. Tang H, editor. P Lo S Genet [Internet]. 2012 [cited 2025 Sept 14];8:e 1002967. Available from: 10.1371/journal.pgen.100296710.1371/journal.pgen.1002967 PMC 349926023166502 · doi ↗ · pubmed ↗
- 5Platzer K, Lemke JR. GRIN 2B-Related Neurodevelopmental Disorder. In: Adam MP, Feldman J, Mirzaa GM, Pagon RA, Wallace SE, Bean LJ, editors. Gene Reviews® [Internet]. Seattle (WA): University of Washington, Seattle; 1993 [cited 2024 Jan 29]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK 501979/29851452 · pubmed ↗
- 6Saunders GRB, Wang X, Chen F, Jang S-K, Liu M, Wang C et al. Genetic diversity fuels gene discovery for tobacco and alcohol use. Nature [Internet]. 2022 [cited 2024 Jan 29];612:720–4. Available from: https://www.nature.com/articles/s 41586-022-05477-410.1038/s 41586-022-05477-4PMC 977181836477530 · doi ↗ · pubmed ↗
