# Merging High-Throughput, Amplicon-Based Second and Third Generation Sequencing Data: An Integrative and Modular Data Analysis Framework for Haplotype Prediction and Output Evaluation

**Authors:** Sylvia Mink, Christian Attenberger, Yannik Busch, Johanna Kiefer, Wolfgang Peter, Janne Cadamuro, Tim A. Steiert, Andre Franke, Christoph Gassner

PMC · DOI: 10.3390/ijms26073443 · 2025-04-07

## TL;DR

This paper introduces a new framework that combines second and third generation sequencing data to improve haplotype prediction and simplify complex genomic analysis.

## Contribution

The novel contribution is an integrative, modular framework that automates and streamlines haplotype prediction using both Illumina and ONT sequencing data.

## Key findings

- The framework successfully validated using synthetic and real-life data from 400 blood donors.
- It combines the accuracy of second-generation and the long-read capability of third-generation sequencing.
- Haplotypes are frequency-ranked and discrepancies are color-coded for easy evaluation.

## Abstract

Despite providing highly accurate results, the short reads generated by second generation sequencing have major limitations in mapping complex genomic regions. Longer reads can resolve these issues and additionally phase distant variants. The third generation sequencing platform ONT currently achieves the longest sequencing reads but falls short in sequencing accuracy. Additionally, deriving phased haplotypes from amplicon-based NGS data remains a complex and time-consuming task that requires extensive bioinformatic expertise. We constructed an integrative, open-access modular data-analysis framework that allows for automated processing of high-throughput sequencing data from both second (Illumina) and third generation (ONT) sequencing platforms, combining the strengths of both technologies. Variant information is automatically evaluated and color-coded for discrepancies. Haplotypes are listed by frequency. All parts of the framework can be used independently. The framework’s performance was validated using synthetic and tested with real-life data by analyzing partly homologous FUT1/2/3 sequencing data from 400 blood donors.

## Linked entities

- **Genes:** FUT1 (fucosyltransferase 1 (H blood group)) [NCBI Gene 2523], FUT2 (fucosyltransferase 2 (H blood group)) [NCBI Gene 2524], FUT3 (fucosyltransferase 3 (Lewis blood group)) [NCBI Gene 2525]

## Full-text entities

- **Genes:** FUT3 (fucosyltransferase 3 (Lewis blood group)) [NCBI Gene 2525] {aka CD174, FT3B, FucT-III, LE, Les}, FUT1 (fucosyltransferase 1 (H blood group)) [NCBI Gene 2523] {aka H, HH, HSC}, HLA-A (major histocompatibility complex, class I, A) [NCBI Gene 3105] {aka HLAA}, HLA-C (major histocompatibility complex, class I, C) [NCBI Gene 3107] {aka D6S204, HLA-JY3, HLAC, HLC-C, MHC, PSORS1}, SMN2 (survival of motor neuron 2, centromeric) [NCBI Gene 6607] {aka BCD541, C-BCD541, GEMIN1, SMNC, TDRD16B}, CDS1 (CDP-diacylglycerol synthase 1) [NCBI Gene 1040] {aka CDS 1}, CYP2D6 (cytochrome P450 family 2 subfamily D member 6 (gene/pseudogene)) [NCBI Gene 1565] {aka CPD6, CYP2D, CYP2D7AP, CYP2D7BP, CYP2D7P2, CYP2D8P2}, HLA-B (major histocompatibility complex, class I, B) [NCBI Gene 3106] {aka AS, B-4901, HLAB}, SMN1 (survival of motor neuron 1, telomeric) [NCBI Gene 6606] {aka BCD541, GEMIN1, SMA, SMA1, SMA2, SMA3}, CFC1 (cryptic, EGF-CFC family member 1) [NCBI Gene 55997] {aka CRYPTIC, DTGA2, HTX2}, FUT2 (fucosyltransferase 2 (H blood group)) [NCBI Gene 2524] {aka B12QTL1, SE, SEC2, Se2, sej}
- **Diseases:** congenital heart defects (MESH:D006330), intestinal diseases (MESH:D007410), spinal muscular atrophy (MESH:D009134), cancer (MESH:D009369), injury to (MESH:D014947)
- **Chemicals:** betaine (MESH:D001622), BQBZ (-), agarose (MESH:D012685), H2O (MESH:D014867)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11990026/full.md

---
Source: https://tomesphere.com/paper/PMC11990026