Population dataset for 23 Y-STR in the Merkit clan form Kazakh population
Bekzhan Faizov, Alizhan Bukayev, Zhaxylyk Sabitov, Maxat Zhabagin

TL;DR
This study analyzes Y-STR data from the Merkit clan in Kazakhstan to better understand their genetic structure and lineage.
Contribution
The study provides a new population dataset of 23 Y-STR markers for the Merkit clan, submitted to a forensic database.
Findings
64 complete haplotypes were generated and submitted to YHRD.
Four distinct haplogroup clusters were identified in the Merkit clan.
The dataset enhances understanding of Kazakh genetic structure and forensic genetics.
Abstract
This study presents a comprehensive analysis of 23 Y-STR data for the Merkit clan, a subgroup within the Kerey tribe of the Kazakh people. A total of 64 complete haplotypes were generated using the PowerPlex Y23 System. The data obtained using 23 Y-STR markers has been submitted to the Y Chromosome Haplotype Reference Database (YHRD) at yhrd.org, which will significantly enhance the forensic database for the Kazakh population in Kazakhstan. The research focuses on the distribution of haplotypes within the clan and their genealogical lines, which were visualized using a Median-joining network and Multidimensional scaling plot. The study identifies four distinct haplogroup clusters, revealing important insights into the genetic makeup and historical lineage of the Merkits. This dataset not only enriches our understanding of Kazakh genetic structure but also holds significant value for…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForensic and Genetic Research · Forensic Anthropology and Bioarchaeology Studies · Animal Diversity and Health Studies
Specifications TableSubjectGeneticsSpecific subject areaForensic Genetics, DNA Profiling, Population GeneticsData formatRaw and AnalyzedType of dataSupplementary tables (5), table (1) and figures (3)Data collectionSaliva samples were collected using the Oragene DNA (OG-500) Self-Collection Kit (DNA Genotek, Canada). Samples were obtained from healthy Kazakh male volunteers, each confirmed not to be related to any other participant for at least three generations and knowledgeable about their clan affiliation. DNA isolation from the saliva was performed using the prepIT-L2P kit (DNA Genotek, Canada). The DNA concentration and quality were determined using a Qubit 2.0 Fluorometer and NanoDrop One Spectrophotometry, respectively (both from Thermo Fisher Scientific, USA). PCR amplification was performed using the PowerPlex Y23 System (Promega, USA) in a reaction mixture on a SimpliAmp Thermal Cycler (Thermo Fisher Scientific, USA). Electrophoresis for separating PCR products was conducted using WEN Internal Lane Standard 500 (Promega, USA) in Hi-Di Formamide on an 8 capillary Applied Biosystems 3500 genetic analyzer with POP-4 polymer, Cathode and Anode buffers (Thermo Fisher Scientific, USA). STR allele calls in electropherograms were determined using GeneMapper IDx v.1.6 software. Further data analysis was conducted using Microsoft Office Excel, Arlequin program, STRAF Software 2.1.5, and Haplomatch tool.Data source locationLaboratory of Human Genetics, National Center for Biotechnology, Astana, 010000, KazakhstanData accessibilityRepository name: YHRD - Y-chromosome haplotype research databaseData identification number: YA006027Direct URL to data: https://yhrd.org/details/contribution/6027Instructions for accessing these data: All data are available with the article as supplementary files.
Value of the Data
1
- •The data of 23 Y-STR profiles from the Merkit clan reveals their paternal signature within the Kazakh population, illuminating sub-population differentiation and enriching our understanding of Kazakh genetic structure.
- •Contributing Merkit clan Y-STR profiles to the Y-chromosomal Haplotype Reference Database enhances the accuracy and discriminatory power of DNA evidence in kinship analysis, criminal investigations, and missing persons’ cases for forensic laboratories worldwide.
- •This dataset is a valuable resource for anthropological, population genetic, and evolutionary research, providing insights into the unique genetic makeup of the Kazakh population and illuminating the genetic connections between Kazakhs and other global populations.
- •The dataset, encompassing 23 Y-STR profiles from the Merkit clan, significantly contributes to the advancement of genetic genealogy, providing interdisciplinary insights to elucidate the demographic and historical events.
Background
2
Y-chromosomal short tandem repeats (Y-STRs) are integral tools in forensic science, providing critical insights for criminal investigations and legal proceedings. Additionally, these genetic markers are extensively utilized in genealogical research, anthropology, and population genetics, aiding in the tracing of paternal lineages and understanding population structures and migrations. In Kazakh society, the affiliation of each male individual to a particular clan is a cultural practice that has been preserved. Each clan is characterized by its own distinct history, origin, and genealogy, which is traditionally traced back to a shared common ancestor. However, these ethnographic records often vary. Today, advancements in genetic research, particularly the analysis of the Y-chromosome, provide a scientific means to verify and augment these ethnographic data. Consequently, there is a pressing need to delineate the genetic diversity of Y-chromosomes among Kazakh clans. In this study, we have presented a dataset of haplotypes based on 23 Y-STR markers specifically for the Merkit clan. These datasets expand the Y-chromosome haplotype research database, providing valuable resources for both forensic genetics and genetic genealogy.
Data Description
3
In this brief article, we present 23 Y-STR data for the Merkit clan, a subgroup within the Kerey tribe of the Kazakh people. A total of 64 complete haplotypes were generated using the PowerPlex Y23 System (Promega, USA). The Merkits have been an integral part of the broader narrative of Mongolia and Central Asian history [1]. Their lineage, often intertwined with the vast and complex history of nomadic tribes in the region, offers a rich context for genetic studies. Despite their historical significance, there has been a notable gap in genetic research specifically focused on the Merkit clan. This study aims to bridge this gap by providing paternal genetic data, which is crucial for understanding not only the genetic makeup of the Merkits, but also their relationship and distinction within the wider Kazakh population. The lack of prior genetic studies on the Merkit clan underscores the importance of this research, as it contributes to a more comprehensive understanding of the genetic diversity and historical migrations of Central Asian nomadic tribes. Furthermore, this genetic data holds immense value for forensic genetics. The unique Y-STR profiles of the Merkit clan can significantly enhance the accuracy and reliability of DNA-based analyses in forensic science.
The distribution of 23 Y-STR haplotypes within the Merkit clan from Kazakhstan, involving 64 individuals, is detailed in Supplementary Table 1. Analysis of haplotype frequencies identified 33 distinct haplotypes, as shown in Supplementary Table 2. Among these, 7 haplotypes were shared by pairs of individuals, and one haplotype was shared by groups of three, four, and six individuals. The two most prevalent haplotypes were each shared among eight individuals. Forensic parameters that describe the Merkit clan were computed, including Discrimination Capacity (DC = 52%), Haplotype Match Probability (HPM = 0.06), and Haplotype Diversity (HD = 0.96). Allele frequencies and forensic parameter values for 23 individual Y-STR loci in the Merkit clan are presented in Fig. 1 and Supplementary Tables 3 and 4. The Y-STR profiling revealed 80 alleles at single-copy loci, with frequencies ranging from 0.02 to 0.95. The minimum number of allelic variants (n = 2) was observed at the DYS393, DYS437, and DYS456 loci. The DYS481 locus exhibited the highest number of allelic variants (n = 11). Gene diversity (GD) among the Merkits varied from 0.09 at DYS438 to 0.85 at DYS481. Contrary to the standard loci, the average gene diversity of the six rapidly mutating single-copy loci (GD = 0.51) was better compared to the fifteen standard single-copy loci (GD = 0.46). The lowest GD values in rapidly mutating loci were observed at DYS533 (GD = 0.30) and DYS570 (GD = 0.28). Among the standard loci, the highest GD values were recorded at DYS439 (GD = 0.75) and DYS635 (GD = 0.73), which are on par with those observed in rapidly mutating loci. The frequency distribution of specific haplotypes and forensic parameters for the DYS385a/b locus in the Merkits are detailed in Supplementary Table 4, with the locus showing 7 haplotype combinations from 7 distinct alleles. The gene diversity of the DYS385a/b locus is 0.68. One copy number variation was identified in the Merkit clan at the DYS19 locus (alleles 15, 16).Table 1. Indicators of similarity between the founder haplotypes of the Merkits' haplogroup clusters and those of the surrounding Kazakh tribes were calculated using the Haplomatch program.Table 1. HapotypeHaplogroup ClusterMatch TribeSample sizeexact matchesone-step neighborstwo-step neighborsthree-step neighborsHP1I2a1Zhalayir930006HP2N1a2Kerei1413100HP4O2a2Naiman1611721168HP7C2a1a3Kerei101561722Fig. 1Distribution of allelic frequencies per locus on 23 Y-STR loci in Merkit clan form Kazakh population using STRAF software.Fig 1
The distribution of haplotypic diversity within four genealogical lines of the Merkit clan (Almambet, Kulsary, Lepes, Shagyr) is visualized on a Median-joining network for 21 Y-STRs, as presented in Fig. 2. Four haplogroup clusters represent the Merkit haplotypes: C2a1a3-F1918, I2a1-P37.2, N1a2-L666, O2a2-P201. A search for founder haplotypes in the Kazakh haplotype database reveals matches within three mutational steps in other Kazakh tribes (Supplementary Table 2). This indicates genealogical-historical connections between them. In the genetic space, depicted in Fig. 3, the Merkits occupy a distinct position, clearly not forming a genetic subgroup within any tribe.Supplementary Table 5 presents pairwise genetic distances (RST) between Kazakh clans based on 17 Y-STRs. The closest genetic distance is from the Sunak clan (d= 0.1), while the most distinct genetic distances are from the Argyn (d=0.51), Kerey (d=0.45), and Konyrat tribes (RST=0.40).Fig. 2. Median-joining network for the haplotypes of 64 Merkit clan members from the Kazakh population was constructed from data on 21 Y-STRs (DYS385a/b loci were excluded). The colors represent the genealogical lineages of the Merkit clan. Circles represent haplotypes, with the area proportional to the sample size, and lines between them proportional to the number of mutational steps. Haplotype clusters are delineated with a dotted circle and annotated according to haplogroups.Fig 2. Fig. 3Multidimensional scaling plot based on pairwise genetic distance (RST) between Kazakh clans on 17 Y-STRs.Fig 3
Experimental Design, Materials and Methods
4
Sample Collection
4.1
To conduct the research, 64 saliva samples were collected from healthy Kazakh male volunteers. The collection of samples was conducted in accordance with the criteria of the population-genetic biobank [2]. Each volunteer provided informed consent by signing a consent form and answered questions regarding their ancestors' origins to participate in the study. The volunteers, each with knowledge of their clan and genealogical lineage, were confirmed not to be related to each other for at least three paternal generations. Saliva samples were collected using an Oragene DNA (OG-500) Self-Collection Kit (DNA Genotek, Canada) according to the manufacturer's protocol.
DNA Isolation, Amplification and STR Genotyping
4.2
DNA was isolated from saliva samples using the prepIT-L2P kit (DNA Genotek, Canada). The DNA concentration was subsequently determined using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific, USA), and DNA quality was assessed via NanoDrop One Spectrophotometry (Thermo Fisher Scientific, USA). PCR (Polymerase Chain Reaction) amplification was performed in a reaction mixture utilizing the PowerPlex Y23 System (Promega, USA) on a SimpliAmp Thermal Cycler (Thermo Fisher Scientific, USA). The PCR products were separated by electrophoresis with WEN Internal Lane Standard 500 (Promega, USA) in Hi-Di Formamide (Thermo Fisher Scientific, USA) using an 8 capillary Applied Biosystems 3500 genetic analyzer with POP-4 polymer, and both Cathode and Anode buffers (Thermo Fisher Scientific, USA). Control DNA 007 (Thermo Fisher Scientific, USA) served as a positive control, while ddH2O was used as a negative control for each genotyping batch. The PowerPlexY23 System (Promega, USA) encompasses 17 conventional Y-STR markers (DYS19, DYS385 a/b, DYS389I/II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635, Y-GATA-H4) along with 6 loci known for their high mutation rates (DYS481, DYS533, DYS549, DYS570, DYS576, DYS643). Concurrently with each sample batch, negative and positive controls were run. Samples exhibiting non-standard patterns, off-ladder, and microvariant alleles were repeated. Our laboratories have successfully completed the YHRD Quality Control Test (YC000343) for contributing haplotype data. Following the guidelines for population genetic data [3], the haplotypes were submitted to the YHRD (http://www.yhrd.org) under accession number YA006027. The population genetic data is included in Supplementary Table 1.
Data Analysis
4.3
STR allele calls were determined from electropherograms using GeneMapper IDx v.1.6 software. Haplotype frequencies were calculated using the Arlequin program ver 3.5 [4]. We calculated the number of distinct haplotypes, the frequency of unique haplotypes, discrimination capacity, haplotype match probability, and haplotype diversity through direct counting by Microsoft Office Excel. Haplotype diversities were computed using the formula HD = n*(1 − ∑p_i_˄2)/(n − 1), where n represents the sample size and pi the frequency of the i-th haplotype [5]. Haplotype match probability (HMP) was determined using the sum of squared observed haplotype frequencies. Discrimination capacity (DC) was calculated as the ratio of the total distinct haplotypes to the number of haplotypes. Forensic parameters, such as Random Match Probability (RM), Power of Discrimination (PD), Gene Diversity (GD), Polymorphism Information Content (PIC), Power of Exclusion (PE), Typical Paternity Index (TPI), and frequency for each locus, were calculated using the STRAF 2.1.5 [6]. The YHRD website's 'AMOVA and MDS' online tool (http://www.yhrd.org) facilitated the computation of pairwise genetic distances (RST) and multidimensional scaling (MDS). Affiliation of haplotypes to haplogroups was assessed using Nevgen Y-DNA haplogroup predictor (https://www.nevgen.org/) and by comparing haplotypes from published data. The analysis incorporated an extensive dataset, which included 3306 haplotypes for minimum 17 Y-STRs from existing studies on population samples of Kazakhs [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17]. In the search for similar haplotypes, the study employed the Haplomatch tool [18] to analyze Y-chromosome STR haplotypes. Median-joining networks were generated using NETWORK v5.0.1.0 and NETWORK Publisher v2.1.2.5 [19]. The DYS385a/b loci were excluded from network construction due to the inability to associate particular alleles with specific copies.
Limitations
Not applicable.
Ethics Statement
This research was approved by the Ethics Committee of the National Centre for Biotechnology (protocol No. 5 of 16 October 2020) and was performed following the standards of the Declaration of Helsinki 1964. All participants in this study provided their written informed consent prior to sample collection.
CRediT authorship contribution statement
Bekzhan Faizov: Investigation, Writing – original draft, Visualization. Alizhan Bukayev: Investigation, Validation, Project administration. Zhaxylyk Sabitov: Conceptualization, Resources, Data curation, Funding acquisition. Maxat Zhabagin: Methodology, Software, Formal analysis, Writing – review & editing, Supervision.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Serruys H.A note on the “Wild Mekrid Monum. Serica 221963434445
- 2Balanovska E.V.Zhabagin M.K.Agdzhoyan A.T.Chukhryaeva M.I.Markina N.V.Balaganskaya O.A.Skhalyakho R.A.Yusupov Y.M.Utevska O.M.Bogunov Y.V.Asilguzhin R.R.Dolinina D.O.Kagazezheva Z.A.Damba L.D.Zaporozhchenko V.V.Romanov A.G.Dibirova K.D.Kuznetsova M.A.Lavryashina M.B.Pocheshkhova E.A.Balanovsky O.P.Population biobanks: Organizational models and prospects of application in gene geography and personalized medicine Russ. J. Genet.521220161227124310.1134/s 1022795416120024 · doi ↗
- 3Gusmão L.Butler J.M.Linacre A.Parson W.Roewer L.Schneider P.M.Carracedo A.Revised guidelines for the publication of genetic population data Forensic Sci. Int. Genet.30201716016310.1016/j.fsigen.2017.06.00728673498 · doi ↗ · pubmed ↗
- 4Excoffier L.Lischer H.E.Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows Mol. Ecol. Resour.103201056456710.1111/j.1755-0998.2010.02847.x 21565059 · doi ↗ · pubmed ↗
- 5Nei M.Tajima F.Genetic drift and estimation of effective population size Genetics 983198162564010.1093/genetics/98.3.62517249104 PMC 1214463 · doi ↗ · pubmed ↗
- 6Gouy A.Zieger M.STRAF-A convenient online tool for STR data evaluation in forensic genetics Forensic Sci. Int. Genet.30201714815110.1016/j.fsigen.2017.07.00728743032 · doi ↗ · pubmed ↗
- 7Abilev S.Malyarchuk B.Derenko M.Wozniak M.Grzybowski T.Zakharov I.The Y-chromosome C 3* star-cluster attributed to Genghis Khan's descendants is present at high frequency in the Kerey clan from Kazakhstan Hum. Biol.8412012798910.3378/027.084.010622452430 · doi ↗ · pubmed ↗
- 8Balanovsky O.Zhabagin M.Agdzhoyan A.Chukhryaeva M.Zaporozhchenko V.Utevska O.Highnam G.Sabitov Z.Greenspan E.Dibirova K.Skhalyakho R.Kuznetsova M.Koshel S.Yusupov Y.Nymadawa P.Zhumadilov Z.Pocheshkhova E.Haber M.Zalloua P.A.Yepiskoposyan L.Dybo A.Tyler-Smith C.Balanovska E.Deep phylogenetic analysis of haplogroup G 1 provides estimates of SNP and STR mutation rates on the human Y-chromosome and reveals migrations of Iranic speakers P Lo S One 1042015 e 012296810.1371/journal.pone.0122968 PMC 438882725849548 · doi ↗ · pubmed ↗
