# A Protocol to Extract a Specific Genomic Region from a Public Whole-Genome Database and Modify Analytical Bin Length for Population Genetic Studies

**Authors:** Muhammad Shoaib Akhtar, Shoji Kawamura

PMC · DOI: 10.3390/mps7040057 · Methods and Protocols · 2024-07-27

## TL;DR

This paper introduces a protocol to extract specific genomic regions from whole-genome data and adjust bin sizes for population genetic studies.

## Contribution

The paper presents a novel method to extract and analyze targeted genomic regions for population genetics with variable bin lengths.

## Key findings

- A method was developed to extract targeted genomic regions from multi-sample VCF files.
- Tajima’s D analysis was successfully applied to intact genes, pseudogenes, and non-coding regions using this approach.

## Abstract

With the advent of “next-generation” sequencing and the continuous reduction in sequencing costs, an increasing amount of genomic data has emerged, such as whole-genome, whole-exome, and targeted sequencing data. These applications are popular not only in mega sequencing projects, such as the 1000 Genomes Project and UK BioBank, but also among individual researchers. Evolutionary genetic analyses, such as the dN/dS ratio and Tajima’s D, are demanded more and more for whole-genome-level population data. These analyses are often carried out under a uniform custom bin size across the genome. However, these analyses require subdivision of a genomic region into functional units, such as protein-coding regions, introns, and untranslated regions, and computing these genetic measures for large-scale data remains challenging. In a recent investigation, we successfully devised a method to address this issue. This method requires a multi-sample VCF file containing population data, a reference genome, target regions in the BED file, and a list of samples to be included in the analysis. Given that the targeted regions are extracted in a new VCF file, targeted population genetic analysis can be performed. We conducted Tajima’s D analysis using this approach on intact and pseudogenes, as well as non-coding regions.

## Full-text entities

- **Diseases:** injury to people or property (MESH:C000719191)
- **Chemicals:** CPU (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Cell lines:** S2 — Drosophila melanogaster (Fruit fly), Spontaneously immortalized cell line (CVCL_Z232)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11357298/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11357298/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC11357298/full.md

---
Source: https://tomesphere.com/paper/PMC11357298