DL-GapFilling: a novel deep learning framework for improved plant genome gap filling

Yu Chen; Zihao Wang; Gang Wang; Guohua Wang

PMC · DOI:10.1093/bib/bbag007·January 26, 2026

DL-GapFilling: a novel deep learning framework for improved plant genome gap filling

Yu Chen, Zihao Wang, Gang Wang, Guohua Wang

PDF

Open Access

TL;DR

DL-GapFilling is a deep learning framework that improves plant and algal genome assembly by efficiently filling gaps in genomic sequences.

Contribution

DL-GapFilling introduces a novel deep learning model and algorithm for more accurate and efficient genome gap filling.

Findings

01

DL-GapFilling outperforms traditional tools by filling 15.6% to 23.5% more gaps across multiple plant and algal datasets.

02

The framework improves both efficiency and accuracy compared to existing deep learning-based methods.

03

A PredictionFilter mechanism enhances assembly quality by retaining high-confidence predictions.

Abstract

Genome assembly has been a cornerstone of bioinformatics for decades, with faster and more accurate assembly of unknown genomes remaining a critical challenge. However, genome diversity, structural variations, insufficient sequencing depth, and limitations of current algorithms often lead to numerous gaps during assembly, hindering the construction of high-quality reference genomes. While various assembly methods and software tools have been developed, most exhibit low efficiency in gap filling and fail to account for the intrinsic structural properties of genomic sequences. Here, we present DL-GapFilling, a deep learning-based framework for genome assembly and gap filling. DL-GapFilling leverages a novel Deep Filling Neural Network model to efficiently extract and contextualize flanking sequence information, and incorporates the BeamStar contraction-expand algorithm, which integrates a…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species5

Thalictrum thalictroides(rue-anemone · species)Utricularia gibba(humped bladderwort · species)Eutrema salsugineum(saltwater cress · species)Micromonas pusilla(species)Oryza longistaminata(longstamen rice · species)

Chemicals1

DLGapCloser

Diseases1

LSTM

Figures8

Click any figure to enlarge with its caption.

DFillingNet network architecture for gene sequence prediction; DFillingNet comprising four key components: the input layer, the CNN layer, the BL-Res layer, and the output layer.

The PredictionFilter is a core screening mechanism designed to evaluate and refine the initial sequences predicted by the DFillingNet model; DFillingNet generates putative gap-filled sequences based on the input flanking regions; however, these raw predictions may contain errors; the PredictionFilter then applies stringent criteria (e.g. alignment score, continuity, and consistency with the original scaffold) to select the most accurate predictions for subsequent hybrid assembly.

The gap-filling effectiveness of the four methods on five datasets is shown; (a) indicates the number of gaps that can be filled using the four gap-filling methods on five datasets; (b) indicates the proportion of gaps that can be filled using the four methods on each dataset; DL-GapFilling can fill the largest number of gaps on all datasets; from the results, DL-GapFilling performs the best on all datasets and is able to fill the most gaps.

Mismatches in the four assembly methods; DL-GapFilling significantly reduced mismatches in T. thalictroides, and in O. longistaminata all gap-filling tools that incorporated DL resulted in an increase in mismatches, with a slight decrease in DL-GapFilling.

N’s numbers represent the total number of unidentified bases (N) in a genome assembly, which typically arise due to sequencing gaps or unresolved regions; a higher N’s number suggests lower assembly quality, indicating potential sequencing errors or difficulties in assembling repetitive regions.

N’s per 100 kb refers to the number of undetermined bases (N) within every 100 000 base pairs (bp) of a genome assembly, serving as an indicator of assembly quality; a lower value suggests fewer unresolved regions and a more complete assembly, reflecting the effectiveness of the gap-filling method; DL-GapFilling outperforms other methods by achieving the lowest N’s per 100 kb across five datasets.

In the O. longistaminata dataset, Fig. 8a shows the number of Nx, NAx, NGx, and NGAx for the three deep learning-based assembly methods; Figure 8b shows the cumulative length, cumulative length (aligned contigs), FRCurve (misassemblies), and genome fraction for the three deep learning-based assembly methods; Figure 8c shows the GC content distribution on the O. longistaminata dataset; Figure 8d shows the GC content profiles on the M. pusilla, T. thalictroides, E. salsugineum, and U. gibba datasets, respectively; the dotted line represents the actual GC content of the reference genome, and the solid line represents the result after assembly by the existing method; Figure 8e visualizes the distribution of GC content in the contigs of these four plant or algal datasets.

Tables1

Table 1. Ablation experiments and performance analysis of BSCEA algorithm and DFillingNet architecture on Micromonas pusilla dataset

Methods	Network structure	Prediction algorithm	Gap filling num	Gap filling rate(%)
Sealer	–	–	73	51.77%
GapPredict	LTSM	Beam search	77	54.61%
DLGapCloser	DGCNet(CNN+BiLTSM)	Wave-Beam Search	83	58.87%
GapPredict(BSCEA)	LTSM	BSCEA(16)	87	61.70%
Res-2-BiLTSM(BSCEA)	2-BiLTSM+Res	BSCEA(16)	88	62.41%
DL-GapFilling(16)	DFillingNet(CNN+2-BiLTSM+Res)	BSCEA(16)	91	64.54%
DL-GapFilling(32)	DFillingNet(CNN+2-BiLTSM+Res)	BSCEA(32)	94	66.67%
DL-GapFilling(64)	DFillingNet(CNN+2-BiLTSM+Res)	BSCEA(64)	95	67.38%

Equations13

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {y}_{c_{out},t}=\sum_{c_{in}=1}^{C_{in}}\sum_{k=1}^K{w}_{c_{out},{c}_{in},k}\cdotp{x}_{c_{in},t\cdotp S+k-P}+{b}_{c_{Out}} \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {y}_i=\gamma \bullet \frac{x_i-{\mu}_B}{\sqrt{\sigma_B^2+\epsilon }}+\beta \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {h}_{BiLSTM1,t}=\left[{\overrightarrow{h}}_{BiLSTM1,t};{\overleftarrow{h}}_{BiLSTM1,t}\right] \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overrightarrow{h}}_{\!\!BiLSTM1,t}={LSTM}_{fwd}\left({h}_{conv}\left[t\right],{\overrightarrow{h}}_{BiLSTM1,t-1}\right) \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overleftarrow{h}}_{\!\!BiLSTM1,t}={LSTM}_{bwd}\left({h}_{conv}\left[t\right],{\overleftarrow{h}}_{BiLSTM1,t-1}\right) \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {h}_{BiLSTM2,t}=\left[{\overrightarrow{h}}_{BiLSTM2,t};{\overleftarrow{h}}_{BiLSTM2,t}\right] \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overrightarrow{h}}_{\!\!BiLSTM2,t}={LSTM}_{fwd}\left({h}_{conv}\left[t\right],{\overrightarrow{h}}_{BiLSTM2,t-1}\right) \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overleftarrow{h}}_{\!\!BiLSTM2,t}={LSTM}_{bwd}\left({h}_{conv}\left[t\right],{\overleftarrow{h}}_{BiLSTM2,t-1}\right) \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} y\left[t\right]={h}_{BiLSTM2,t}+{h}_{conv1D,t} \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} newG= currentG+{\sum}_{i=1}^n\big| \mathit{\ln}\left({probability}_i\right)\big| \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} Heuristic(S)=\alpha \cdotp H(S)+\beta \cdotp normRemLen+\gamma \cdotp CGRatioImpact \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} \varDelta C{G}_{weighted}=\left|\frac{C_{next}+{G}_{next}}{\ {L}_{next}}-\frac{C_{current}+{G}_{current}}{L_{current}}\right| \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} Heuristic(S)=&-\alpha \cdotp \frac{\sum_{i=1}^np\left({x}_i\right){\mathit{\log}}_bp(x) dx}{maxEntropy}\notag\\&+\beta \cdotp \frac{\mathit{\ln}\left( totalLength- SeedLength\right)}{\mathit{\ln}(totalLength)}\notag\\&+\gamma \cdotp \left|\frac{C_{next}+{G}_{next}}{\ {L}_{next}}-\frac{C_{current}+{G}_{current}}{L_{current}}\right| \end{align*}\end{document}

Funding2

—National Key Research and Development Program of China10.13039/501100012166
—National Natural Science Foundation of China10.13039/501100001809

Keywords

genomesassembliesgapsDLGapFilling

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Genome Rearrangement Algorithms · Genomics and Chromatin Dynamics

Full text

Introduction

The popularization of second-generation sequencing technology has made genome sequencing more efficient, and has promoted the determination of reference genomes for more species. However, the insufficient sequencing depth, repetitive sequence structure, and technical limitations of second-generation sequencing often result in unidentified gaps in genome assembly. Filling gaps usually requires additional sequencing data or computational prediction methods. The main methods currently include multi-software assembly, the use of reference genomes from related species, polymerase chain reaction amplification of end gaps, and improved assembly methods based on De Bruijn graph [1].

In the field of genome assembly, many tools have been developed to meet the needs of different sequencing technologies. For second-generation sequencing (short-read length) data, tools such as Velvet [2], ABySS2 [3], SPAdes [4], IDBA-UD [5], ALLPATHS [6], ALLPATHS-LG [7], SOAPdenovo2 [8], SKESA [9], MEGAHIT [10], and metaMDBG [11] focus on processing high-throughput short-read-length data, which are able to effectively deal with the problems of gaps and duplicate regions that may be encountered by the short-read-lengths in the assembly process. These tools utilize the high coverage of short-read length data to provide high-quality assembly results. In addition, to further enhance the quality of assembly, specialized gap filling tools such as HaploMerger2 [12], Sealer [13], GapFiller [14], TIGRA [15], Pilon [16], and FGAP [17] have been designed to fill the gaps during the assembly process and optimize genome contiguity by leveraging the paired-end information and local coherence of short-read lengths. However, the reliance of second-generation sequencing technologies on assembling short fragments to construct genome sequences imposes significant limitations, particularly in resolving repetitive sequences and structural variants within complex genomes. This underscores an urgent need for advancements in sequencing technologies capable of producing longer read lengths.

With the emergence of third-generation sequencing technologies (PacBio SMRT and Oxford Nanopore), long-read length assembly tools such as Canu [18], HiCanu [19], Flye [20], RFfiller [21], Wtdbg2 [22], Shasta [23], NextDenovo [24], and Verkko2 [25] were developed for this application, and the gap tools for its assembly are TGS-GapCloser [26], Racon [27], and LR GapCloser [28]. These tools are good at handling long-read length data and are able to span complex repetitive regions, which significantly improves the integrity and continuity of the genome. In terms of improving base-level accuracy, tools such as NextPolish [29], NextPolish2 [30], PEPPER-Margin-DeepVariant [31], DeepConsensus [32], Apollo [33], and DeepPolisher [34] can perform multiple rounds of polishing on the assembled sequence, thereby optimizing the sequence quality near gaps. In order to better combine the advantages of short- and long-read length data, hybrid assembly tools such as HybridSPAdes [35], OPERA-LG [36], MaSuRCA [37], Unicycler [38], HASLR [39], SAMBA [40], and DENTIST [41] were developed. These tools provide a more reliable basis for genome assembly by integrating different types of sequencing data and maximizing the high accuracy of short-read lengths and the coverage advantage of long-read lengths. In recent years, deep learning has made great progress in predicting unknown regions and repairing sequence deletions, which offers new possibilities for gap filling. For example, Multivariate Variational Mode Decomposition (MVMD) [42] extracts features from multi-channel electroencephalography (EEG) and combines it with long short-term memory (LSTM) for high-precision classification, demonstrating the potential of multivariate signal decomposition and deep learning in complex pattern recognition. In the post-processing stage of genome assembly, deep learning tools have also been increasingly applied: NextPolish [29], NextPolish2 [30], DeepConsensus [32], Apollo [33], and DeepPolisher [34] are used for error correction; DeepTrio [43] enables haplotype phasing; and DeepMicrobes [44] can identify and remove contaminated sequences. At the same time, the integrated hybrid assembly and post-processing tool Verkko2 [25] can achieve end-to-end optimization for assembly, gap filling, and phasing, improving the continuity and accuracy of complex eukaryotic genomes. Gappredict [45] uses LSTM based on the sequence information flanking the gap to predict and fill missing sequences, improving assembly quality, while DLGapCloser [46] integrates homologous genome mapping to optimize gap sequence prediction.

All of them can improve the assembly quality by expanding the assembly sequences to fill the gap. However, there are problems such as the prediction network structure involving only a single layer, which is too simple, and the sequence prediction algorithm only considers the probability value of each base. We propose a novel method, DL-GapFilling, which introduces three key innovations: (i) Deep Filling Neural Network (DFillingNet) for efficient extraction of flanking sequence information; (ii) the BeamStar Shrink-Expand Algorithm (BSCEA) for integrating cost functions and optimizing sequence predictions, enhancing gap-filling generalization and efficiency; and (iii) the PredictionFilter mechanism, which retains high-confidence predictions and mitigates the effect of low-quality sequences on assembly. Evaluation on multiple datasets demonstrates DL-GapFilling’s superior performance and robustness.

Materials and methods

Gaps in gene assemblies are often caused by sequencing errors, repetitive sequences, insertions, deletions and under-coverage, and this problem is further exacerbated by the limitations of the assembly algorithm. To address this problem, deep learning is employed to predict gap sequences by analyzing the flanking regions surrounding the gaps. The DL-GapFilling method uses the DFillingNet model, which consists of an input layer, a Convolutional Neural Network (CNN) layer, a BL-Res layer, and an output layer. The specific flowchart is shown in Fig. 1. The input layer encodes the DNA sequence in one-hot format to convert it into a high-dimensional vector. The CNN layer extracts local features, while the BL-Res layer employs stacked bidirectional long short-term memory (BiLSTM) units with residual connections to model long-range sequential dependencies. A one-dimensional CNN implements a residual connection between the BiLSTM layers to alleviate the gradient propagation problem. In the sequence expansion stage, the BeamStar contraction-expansion algorithm (BSCEA) is employed, which is an effective search algorithm for optimizing the expansion of the predicted sequence. In addition, a screening mechanism, PredictionFilter (Fig. 3), is proposed to effectively remove low-quality sequences generated during prediction, thereby improving the accuracy of gap filling.

DFillingNet network architecture for gene sequence prediction; DFillingNet comprising four key components: the input layer, the CNN layer, the BL-Res layer, and the output layer.

We use one-hot encoding to convert each base into a unique binary vector. The base information contained in the flanks on both sides of the gap is represented by one-hot encoding, in which each base (A, T, C, G) is converted into a four-dimensional vector.

Model architecture

DFillingNet consists of four components: the input layer, CNN layer, BL-Res layer, and output layer (Fig. 1). DNA sequences are encoded in one-hot format at the input layer, converting them into high-dimensional vectors. The CNN layer extracts local features and facilitates hierarchical learning, while max pooling, batch normalization (BN), and dropout enhance efficiency and prevent overfitting. The BL-Res layer captures bidirectional dependencies using stacked BiLSTM units and residual connections to alleviate the vanishing gradient problem. The output layer iteratively expands nodes to predict the sequence and fill gaps.

The CNN layer contains operations such as convolutional operations, local pooling, BN, dropout, etc. We use a convolution kernel size of 3 and an activation function of ReLU. In the input sequence $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \mathrm{X}=\left{{X}_1,{X}_2\dots, {X}T\kern0em \right},{X}t\in{R}^{d_x\times T} $\end{document}$ , the size of the convolution kernel is k, the number of convolution input channels is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {d}h $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L $\end{document}$ is the length of the input sequence, and the output sequence is $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ Y\mathbf{\in}{R}^{C{out}\times{L}{out}} $\end{document}$ . For the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {C}{out} $\end{document}$ th output channel at the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ t $\end{document}$ th position, the convolution operation is given by:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {y}_{c_{out},t}=\sum_{c_{in}=1}^{C_{in}}\sum_{k=1}^K{w}_{c_{out},{c}_{in},k}\cdotp{x}_{c_{in},t\cdotp S+k-P}+{b}_{c_{Out}} \end{equation*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {w}{c{out},{c}{in},k} $\end{document}$ is the weight of the convolution kernel, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {x}{c_{in},t\cdotp S+k-P} $\end{document}$ is the value in the input sequence involved in the convolution, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {b}{c{out}} $\end{document}$ is the bias. The convolution operation extracts local features of the input data through local feature extraction, parameter sharing, and translation invariance, while reducing the number of parameters, thus improving computational efficiency.

BN mitigates the gradient problem in deep networks by normalizing the activation values to a zero mean and unit variance. The convergence is accelerated by reducing the range of variation of activation values in different layers. Its formula is as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {y}_i=\gamma \bullet \frac{x_i-{\mu}_B}{\sqrt{\sigma_B^2+\epsilon }}+\beta \end{equation*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {x}_i $\end{document}$ is the input eigenvalue, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\mu}_B $\end{document}$ is the mean of the batch data, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\mathrm{\sigma}}_B^2 $\end{document}$ is the variance of the small batch data, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \gamma $\end{document}$ is the scaling parameter for learning, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \beta $\end{document}$ is the learnable offset parameter.

The BL-Res layer consists of two BiLSTM layers and a maximum pooling layer, where each BiLSTM layer consists of forward and inverse LSTMs for capturing bi-directional contextual information of the sequence data. The output of the CNN layer is used as an input to the first BiLSTM layer, which is processed by the forward LSTMs and the inverse LSTMs to obtain a bi-directional feature representation. The formula is as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {h}_{BiLSTM1,t}=\left[{\overrightarrow{h}}_{BiLSTM1,t};{\overleftarrow{h}}_{BiLSTM1,t}\right] \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overrightarrow{h}}_{\!\!BiLSTM1,t}={LSTM}_{fwd}\left({h}_{conv}\left[t\right],{\overrightarrow{h}}_{BiLSTM1,t-1}\right) \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overleftarrow{h}}_{\!\!BiLSTM1,t}={LSTM}_{bwd}\left({h}_{conv}\left[t\right],{\overleftarrow{h}}_{BiLSTM1,t-1}\right) \end{equation*}\end{document}

The second BiLSTM layer takes the output of the first layer as input and repeats the bidirectional processing to extract deeper bidirectional features. The corresponding formulas are given as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {h}_{BiLSTM2,t}=\left[{\overrightarrow{h}}_{BiLSTM2,t};{\overleftarrow{h}}_{BiLSTM2,t}\right] \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overrightarrow{h}}_{\!\!BiLSTM2,t}={LSTM}_{fwd}\left({h}_{conv}\left[t\right],{\overrightarrow{h}}_{BiLSTM2,t-1}\right) \end{equation*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} {\overleftarrow{h}}_{\!\!BiLSTM2,t}={LSTM}_{bwd}\left({h}_{conv}\left[t\right],{\overleftarrow{h}}_{BiLSTM2,t-1}\right) \end{equation*}\end{document}

In order to solve the problem of gradient vanishing that may be caused by the deep network, this article introduces the residual connection [47], which adds the output of the second BiLSTM layer with the output of the one-dimensional convolutional layer. The output is obtained via residual addition, where the output from the second BiLSTM layer is combined with the output from the 1D convolutional layer. Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {h}{conv}\left[t\right] $\end{document}$ denotes the feature vector at the ttt-th time step obtained from the convolutional layer. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {LSTM}{fwd}\left(\cdotp \right) $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {LSTM}{\mathrm{bwd}}\left(\cdotp \right) $\end{document}$ represent the forward and backward LSTM units, respectively. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\overrightarrow{h}}{!!BiLSTM1,t} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\overleftarrow{h}}{!!BiLSTM1,t} $\end{document}$ denote the forward and backward hidden states of the first BiLSTM layer, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\overrightarrow{h}}{!!BiLTSM2,t} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\overleftarrow{h}}_{!!BiLTSM2,t} $\end{document}$ denote those of the second BiLSTM layer.

Through the two BiLSTM layers, the model is able to obtain deep bi-directional feature representations for each time step, which are then used by the subsequent max-pooling layer to produce a fixed-length high-level feature vector. Finally, the output of the residual join is represented as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} y\left[t\right]={h}_{BiLSTM2,t}+{h}_{conv1D,t} \end{equation*}\end{document}

To further reduce computational complexity while preserving key features, the output after residual concatenation is processed through a maximum pooling layer using a sliding window.

The encoded gene sequence information will be output in the Output layer after being processed by the CNN layer and the BL-Res layer. This layer is mainly for the expansion of the expansion node. The seed sequence will keep growing until it expands into a predetermined length of predicted sequence. Here we construct a new method for predicting sequences (BSCEA) algorithm.

BeamStar contraction–expansion algorithm

Beam search algorithms [48] are widely used for neural network decoding. Although their memory requirements are lower than those of breadth-first search, they may miss the global optimum. The A^*^ algorithm [49] combines path cost and heuristic evaluation, but it consumes a large amount of memory. Best-first beam search [50] improves the memory requirements by optimizing the scoring-based decoding strategy. Wave-beam search [46] adds the “contraction-expansion” pruning strategy to beam search, but it only considers probability values and may fail to find the global optimum. To address this problem, the BSCEA was developed, which incorporates additional influencing factors into the path evaluation process, optimizes path selection, improves algorithm performance, and enhances the quality of the results.

Currently, mainstream sequence prediction methods such as Hidden Markov Model (HMM) [51], Deep Learning (LSTM), or other machine learning methods [52] perform well when dealing with a large number of biological sequences, but the structural features of the sequences are usually neglected. To solve this problem, the BSCEA algorithm proposed in this article combines the A^*^ and BS algorithms in gene sequence prediction for the first time. BSCEA improves the search efficiency by dynamically adjusting the beam width: at the initial stage, the beam width is gradually relaxed to expand the search range; when it approaches the preset beam length, it starts to shrink to control the number of nodes and balance accuracy and resource consumption. At the end of the search, the algorithm selects the lowest cost among the candidate sequences as the final prediction result. Compared with the traditional BS and beam search, BSCEA retains more potentially superior solutions in the expansion phase to avoid early pruning, and accurately screens out the global optimal nodes in the contraction phase. As shown in Fig. 4, BSCEA integrates the multidimensional features of sequences to significantly optimize the performance of gene sequence prediction.

The cost function f(n) of the BSCEA algorithm comprises the actual cost g(n) and the heuristic cost h(n) [53]. The actual cost represents the cumulative uncertainty from the starting node to the current node n and is calculated as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} newG= currentG+{\sum}_{i=1}^n\big| \mathit{\ln}\left({probability}_i\right)\big| \end{equation*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ currentG $\end{document}$ is the accumulated cost at the current node and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {probability}_i $\end{document}$ is the predicted probability of the iii-th base. The summation term quantifies the total prediction uncertainty along the path. The heuristic cost estimates the remaining cost to the target node and is evaluated as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} Heuristic(S)=\alpha \cdotp H(S)+\beta \cdotp normRemLen+\gamma \cdotp CGRatioImpact \end{equation*}\end{document}

Here, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(S) $\end{document}$ is the sequence entropy, which quantifies sequence complexity [54]; normRemLen is the normalized logarithm of the remaining sequence length; and CGRatioImpact measures the guanine–cytosine (GC) content difference between the current sequence and the predicted sequence, reflecting biological plausibility [55]. The weighting coefficients $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \beta $\end{document}$ , and γ balance these contributions. The GC content difference is defined as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{equation*} \varDelta C{G}_{weighted}=\left|\frac{C_{next}+{G}_{next}}{\ {L}_{next}}-\frac{C_{current}+{G}_{current}}{L_{current}}\right| \end{equation*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {C}{current} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {G}{current} $\end{document}$ denote the counts of cytosine and guanine in the current sequence, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {C}{next} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {G}{next} $\end{document}$ denote the corresponding counts in the next seed sequence, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {L}{current} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {L}{next} $\end{document}$ are the lengths of the current and next sequences, respectively [55].

Integrating these components, the BSCEA heuristic function is formulated as:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*} Heuristic(S)=&-\alpha \cdotp \frac{\sum_{i=1}^np\left({x}_i\right){\mathit{\log}}_bp(x) dx}{maxEntropy}\notag\\&+\beta \cdotp \frac{\mathit{\ln}\left( totalLength- SeedLength\right)}{\mathit{\ln}(totalLength)}\notag\\&+\gamma \cdotp \left|\frac{C_{next}+{G}_{next}}{\ {L}_{next}}-\frac{C_{current}+{G}_{current}}{L_{current}}\right| \end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ p\left({x}_i\right) $\end{document}$ is the predicted probability of the i-th base, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ maxEntropy $\end{document}$ is the maximum possible entropy for a sequence of length nnn, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ totalLength $\end{document}$ denotes the total predicted sequence length, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ SeedLength $\end{document}$ is the length of the current seed sequence. This formulation integrates sequence complexity, remaining length, and GC content disparity to guide the algorithm in selecting an optimal and biologically plausible expansion path.

Overview of the DL-GapFilling method

Figure 2 schematically illustrates the DL-GapFilling workflow, which comprises three principal stages: (a) Data Preparation and Homology Mapping, (b) Gap Classification and Sequence Prediction, and (c) Hybrid Assembly and Error Correction.

DL-GapFilling workflow overview.

Stage 1: Data Preparation and Homology Mapping. Short-read sequencing data are preassembled into a scaffold using conventional tools (Fig. 2a). This scaffold is subsequently processed to extract gap information using utilities such as awk and sed. To identify homologous regions, the gap sequences are sorted and expanded using alignment tools including SAMtools [56] and BEDtools [57]. Homologous gene mapping is then performed leveraging the HomoloGene database and tools such as OrthoMCL [58] and OrthoFinder [59]. Concurrently, read lengths are categorized, and mapping indices are constructed using BioBloomMICategorizer [60] and BioBloomMaker [60], respectively (Fig. 2b). This stage culminates in the precise identification of target gap sequences and their flanking regions.

Stage 2: Gap Classification and Sequence Prediction. The identified gaps are categorically divided into two groups: Fixed gaps, which are amenable to resolution by traditional gap-filling tools, and Unfixed gaps, which are not and thus require advanced computational strategies (Fig. 2c). For the Unfixed gaps, the flanking sequence data are fed into the pretrained deep learning model DFillingNet (Fig. 1). The BSCEA algorithm (Table 1) is employed to generate initial, unfiltered sequence predictions to bridge these unresolved gaps.

Stage 3: Hybrid Assembly and Error Correction. The predicted sequences are integrated with the original short-read data in a hybrid assembly process to enhance genomic contiguity and completeness (Fig. 2e). To ensure high fidelity, a dedicated PredictionFilter mechanism (Fig. 2f) performs error correction by comparing the hybrid assembly output against the original scaffold. This critical step validates and refines the predictions, retaining only the most accurate sequences to ensure reliable genome reconstruction.

PredictionFilter mechanism for predicting sequences

The Exonerate tool was used to analyze the alignment files, with alignment rates sorted from 100% to 0%. Higher alignment rates indicate better sequence quality, whereas lower rates reflect poor matches. GapPredict and DLGapCloser directly append predicted sequences to the assembly without evaluating their quality, which may introduce errors. To address this issue, the PredictionFilter mechanism was developed. This mechanism compares the effectiveness of the predicted sequence in filling gaps within homologous gene regions relative to the original assembly. If the prediction improves gap filling, the sequence is classified as available; otherwise, it is discarded. This process ensures that only high-quality predictions are retained, thereby improving the reliability of the assembly (Fig. 3).

The PredictionFilter is a core screening mechanism designed to evaluate and refine the initial sequences predicted by the DFillingNet model; DFillingNet generates putative gap-filled sequences based on the input flanking regions; however, these raw predictions may contain errors; the PredictionFilter then applies stringent criteria (e.g. alignment score, continuity, and consistency with the original scaffold) to select the most accurate predictions for subsequent hybrid assembly.

After the PredictionFilter mechanism, the predicted sequences are divided into Available sequences and Disposable sequences. Subsequently, the original predicted sequences are combined with the next-generation sequencing data (NGS data) to form Assembly 1; meanwhile, the filtered Available sequences are combined with the NGS data to form Assembly 2. These two combinations enter the assembly step separately, and the assembly results (Assembly results 1 and Assembly results 2) are used to evaluate the effectiveness of the filtering mechanism. The PredictionFilter mechanism enhances assembly quality without altering existing deep learning models. By performing secondary optimization on predicted sequences, it maximizes the potential of current models, providing higher-quality inputs for genome assembly.

Results

Gap filling situation

In this experiment, five plant or algal genomes—Micromonas pusilla, Utricularia gibba, Eutrema salsugineum, Thalictrum thalictroides, and Oryza longistaminata—were selected for analysis. This study used the second-generation Illumina short-read DNA paired-end sequencing data format provided by National Center for Biotechnology Information. DL-GapFilling was evaluated on five different plant or algal datasets using three methods, Sealer [13], GapPredict [45], and DLGapCloser [46]—as baseline methods. The following sections compare the gap-filling performance of these tools across multiple dimensions.

In Fig. 4a, the left panel illustrates the number of gap fills, a key metric for evaluating gap-filling quality. DL-GapFilling significantly enhanced gap filling across all five datasets. Specifically, the number of gap fills improved from 73, 83, and 60 to 95, 93, and 69 for M. pusilla, U. gibba, and T. thalictroides, respectively. For O. longistaminata, the number increased from 48 to 76. In Fig. 4b, DL-GapFilling increased the fill rate by 15.6%, 6.1%, and 5.5% for M. pusilla, U. gibba, and T. thalictroides, respectively. For E. salsugineum, the fill rate improved by 16.7%, and for O. longistaminata, the rate increased from 40.3% to 63.9%, a 23.6% improvement over Sealer and 16% over GapPredict, and 11% over DLGapCloser.

The gap-filling effectiveness of the four methods on five datasets is shown; (a) indicates the number of gaps that can be filled using the four gap-filling methods on five datasets; (b) indicates the proportion of gaps that can be filled using the four methods on each dataset; DL-GapFilling can fill the largest number of gaps on all datasets; from the results, DL-GapFilling performs the best on all datasets and is able to fill the most gaps.

Mismatches in the four assembly methods; DL-GapFilling significantly reduced mismatches in T. thalictroides, and in O. longistaminata all gap-filling tools that incorporated DL resulted in an increase in mismatches, with a slight decrease in DL-GapFilling.

Ablation experiments

To evaluate the effectiveness of the DFillingNet model combined with the BSCEA algorithm, ablation experiments were conducted on the M. pusilla dataset. The experimental results show that the fill rate of GapPredict is improved to 61.70% after the introduction of BSCEA, which is 7.09% higher than that of traditional Beam Search, demonstrating the advantage of the dynamic contraction–expansion mechanism of BSCEA in optimizing sequence prediction. It especially performs better with an increase in the beam width. The fill rate of DL-GapFilling reaches 64.54%, 66.67%, and 67.38% at beam widths of 16, 32, and 64, respectively (Table 1).

Additionally, the DFillingNet structure significantly improves performance compared to the base LSTM. The fill rate of Res-2-BiLSTM (BSCEA) with the introduction of residual connectivity is improved by 0.71%, showing that residual connectivity helps to mitigate the vanishing gradient problem and enhance feature transfer. Further combining the CNN module with two-layer BiLSTM to construct DFillingNet, the fill rate is improved to 64.54%, which is 2.13% higher than that of Res-2-BiLSTM, verifying the CNN’s ability to extract local features and the advantages of the combination of BiLSTM and residual connectivity in sequence pattern learning. The overall experiment fully proves the superiority of the BSCEA algorithm and DFillingNet structure.

N’s numbers represent the total number of unidentified bases (N) in a genome assembly, which typically arise due to sequencing gaps or unresolved regions; a higher N’s number suggests lower assembly quality, indicating potential sequencing errors or difficulties in assembling repetitive regions.

N’s per 100 kb refers to the number of undetermined bases (N) within every 100 000 base pairs (bp) of a genome assembly, serving as an indicator of assembly quality; a lower value suggests fewer unresolved regions and a more complete assembly, reflecting the effectiveness of the gap-filling method; DL-GapFilling outperforms other methods by achieving the lowest N’s per 100 kb across five datasets.

In the O. longistaminata dataset, Fig. 8a shows the number of Nx, NAx, NGx, and NGAx for the three deep learning-based assembly methods; Figure 8b shows the cumulative length, cumulative length (aligned contigs), FRCurve (misassemblies), and genome fraction for the three deep learning-based assembly methods; Figure 8c shows the GC content distribution on the O. longistaminata dataset; Figure 8d shows the GC content profiles on the M. pusilla, T. thalictroides, E. salsugineum, and U. gibba datasets, respectively; the dotted line represents the actual GC content of the reference genome, and the solid line represents the result after assembly by the existing method; Figure 8e visualizes the distribution of GC content in the contigs of these four plant or algal datasets.

Missing and mismatches in assembly

While no current method can fully resolve all genome assembly gaps, algorithmic optimization continues to enhance assembly accuracy. Scaffolds often contain mismatches, unmapped bases, or numerous N bases in gap regions. Common errors—such as N bases, mismatches, and indels—are evaluated per 100 kb to assess assembly quality. Figure 5 compares mismatch rates across four methods. DL-GapFilling showed improvements for U. gibba and T. thalictroides, and performed comparably on E. salsugineum. For M. pusilla and O. longistaminata, it slightly increased mismatches but still outperformed other deep learning methods. Indel levels remained consistent across all methods, indicating that DL-GapFilling enhances mismatch correction without introducing new structural errors. With respect to the number of N bases, DL-GapFilling performs slightly worse than Sealer on the O. longistaminata dataset but outperforms both GapPredict and DLGapCloser. In contrast, DL-GapFilling surpasses Sealer, GapPredict, and DLGapCloser on the other four datasets, making it the optimal method (Fig. 6). The suboptimal performance on O. longistaminata is likely due to its highly repetitive genomic regions, which hinder sequence differentiation and affect deep learning-based methods. Figure 7 shows the average number of mismatches per 100 000 aligned bases (N’s per 100 kb). DL-GapFilling consistently outperforms the other methods across all datasets, achieving results comparable to Sealer on O. longistaminata.

Multi-scale advantages of DL-GapFilling

The assembly quality of several deep learning–based methods was comprehensively evaluated using the QUAST [61] assessment tool, which uses multiple metrics targeting different aspects of genome assembly performance. These metrics include Nx, NAx, NGx, NGAx, GC content, cumulative length (aligned overlapping groups), Feature Response Curve (FRCurve), and genome fraction. Figure 8a illustrates the distribution of Nx, NAx, NGx, and NGAx for three deep learning-based assembly methods, applied to the O. longistaminata dataset. Specifically, Nx represents the percentage of assembled bases in the longest contigs, while NAx (with “A” denoting alignment) integrates both the Nx values and the number of misassemblies identified through Plantagora. NGx (Genome Nx) represents the contig length at which x% of the reference genome is covered by contigs of that length or longer, and NGAx further segments contigs into aligned blocks to assess NGx statistics in these blocks. Among the evaluated methods, DL-GapFilling stands out, consistently outperforming both GapPredict and DLGapCloser across all metrics, indicating its superior gap-filling capabilities.

In Fig. 8b, we present additional metrics, including cumulative length, cumulative length (aligned contigs), FRCurve (misassemblies), and genome fraction, calculated for the three deep learning-based assembly methods on the O. longistaminata dataset. Cumulative length reflects the total length of all contigs or scaffolds in the assembly, while cumulative length (aligned contigs) specifically accounts for the contigs that can be aligned to the reference sequence. Notably, DL-GapFilling generates longer contigs, thus enhancing the overall assembly. The FRCurve offers a detailed evaluation of the number of misassemblies across different assembly results. In this comparison, DL-GapFilling demonstrates the lowest misassembly rate, confirming its ability to produce more accurate assemblies. Genome fraction measures the proportion of overlap between the assembly and the reference genome, providing insight into the coverage and completeness of the assembly. DL-GapFilling not only exhibits a higher overlap but also aligns more closely with the actual gene structure, further validating its effectiveness in maintaining genomic integrity.

Figure 8c presents the GC content distribution of the assembly results on the O. longistaminata dataset, offering an assessment of whether the assemblies preserve the genomic GC characteristics [62]. The horizontal axis represents the GC content percentage, and the vertical axis denotes the overlap group. The first graph provides a general overview of the GC content across all datasets, while the subsequent graphs focus on the GC distribution in the scaffolds generated by each assembly method. It is evident that DL-GapFilling generates results with a GC content distribution that is more consistent with the expected genomic characteristics.

Finally, Fig. 8d visualizes the GC content profiles for datasets from four additional plant or algal species: M. pusilla, T. thalictroides, E. salsugineum, and U. gibba. Here, the dotted line represents the actual GC content of the reference genome, and the solid line shows the results after assembly by existing methods. The comparison reveals that current assembly techniques fail to accurately reproduce the GC distribution, further supporting the rationale for incorporating GC content evaluation in the BSCEA. Figure 8e illustrates the distribution of GC content in the overlapping groups across these four plant or algal datasets, providing additional insights into the variation in GC content among different plant or algal genomes [63].

After the comprehensive analysis, it is evident that DL-GapFilling consistently outperforms existing assembly models across several critical evaluation metrics, including Gapclosed Number, Gapclosed (percent), mismatches per 100 kb, N’s numbers, N’s per 100 kb, and GC content. Specifically, DL-GapFilling surpasses both traditional assembly tools and those integrated with deep learning, demonstrating its clear superiority, particularly on plant or algal datasets. The method excels not only in improving the number of gaps closed but also in achieving a higher gap-filling percentage, which is crucial for enhancing the completeness of the assembled genomes. Additionally, DL-GapFilling shows a significant reduction in mismatches, indicating its ability to produce more accurate sequences with fewer errors. Furthermore, the GC content profile generated by DL-GapFilling aligns more closely with the expected values, showcasing its effectiveness in maintaining genomic integrity during assembly. This comprehensive performance across multiple quality metrics highlights DL-GapFilling as an optimal approach for genome assembly, particularly in plant or algal genomes, where traditional and current deep learning-based methods fall short.

Discussion

Our DL-GapFilling framework introduces a novel paradigm for complex gap resolution by strategically integrating deep learning. Instead of a computationally prohibitive end-to-end deep learning approach, we target deep learning specifically to gaps that resist conventional algorithms and complement it with a predictive filtering mechanism. This efficient hybrid strategy significantly improves assembly quality and pioneers a practical path for genome assembly powered by Artificial Intelligence (AI).

The remarkable success of DL-GapFilling lies in its innovative approach and key technical advancements. First, the hybrid network architecture combining residual networks and BiLSTM networks harnesses the advantages of contextual modeling, which is particularly effective for capturing the complex relationships present in genomic sequences. Second, the introduction of the BSCEA has proven to be instrumental in optimizing the sequence expansion process, significantly improving prediction efficiency while simultaneously reducing errors. The BSCEA algorithm’s ability to adaptively adjust gap expansion helps achieve more accurate gap filling and contributes to a more complete genome reconstruction. Third, the PredictionFilter mechanism further enhances the model’s performance by refining predicted sequences, ensuring that only high-quality predictions are incorporated into the final assembly, without the need for model retraining. These innovative approaches—combined with a carefully designed architecture—are what set DL-GapFilling apart from existing methods.

Despite the promising results, DL-GapFilling has several limitations that warrant further investigation. First, its performance has been validated primarily on plant or algal genomes, such as O. longistaminata, leaving its generalizability to animal and microbial genomes untested. Second, while the method addresses gaps in second-generation sequencing data, third-generation platforms introduce distinct challenges, which may demand specialized gap-filling strategies. Future work will focus on scaling DL-GapFilling to ultra-large and non-model genomes, as well as developing tailored approaches for third-generation data to improve assembly continuity and reference genome quality.

Key Points

We construct a DFillingNet model and propose a BSCEA algorithm based on A^*^ and BS, which for the first time considers the characteristics of the sequences themselves to perform a beamwidth dynamic adjustment strategy to improve the prediction accuracy.
For the first time, we introduce a filtering mechanism for the predicted sequences generated by deep learning and successfully improve the gap filling quality through the filtered predicted sequences.
We have constructed a new dataset and evaluation standard covering a wide range of organisms, which provides a reliable reference for gap filling in a wide range of genomes.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Lu P, Jin J, Li Z et al. P Gcloser: fast parallel gap-closing tool using long-reads or contigs to fill gaps in genomes. Evol Bioinform Online 2020;16:117693432091385.
2Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008;18:821–9. 10.1101/gr.074492.10718349386 PMC 2336801 · doi ↗ · pubmed ↗
3Jackman SD, Vandervalk BP, Mohamadi H et al. A By SS 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome Res 2017;27:768–77. 10.1101/gr.214346.11628232478 PMC 5411771 · doi ↗ · pubmed ↗
4Prjibelski A, Antipov D, Meleshko D et al. Using SP Ades d novo assembler. Curr Protoc Bioinformatics 2020;70:e 102. 10.1002/cpbi.10232559359 · doi ↗ · pubmed ↗
5Peng Y, Leung HCM, Yiu SM et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012;28:1420–8. 10.1093/bioinformatics/bts 17422495754 · doi ↗ · pubmed ↗
6Butler J, Mac Callum I, Kleber M et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 2008;18:810–20. 10.1101/gr.733790818340039 PMC 2336810 · doi ↗ · pubmed ↗
7Gnerre S, Mac Callum I, Przybylski D et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 2011;108:1513–8. 10.1073/pnas.101735110821187386 PMC 3029755 · doi ↗ · pubmed ↗
8Luo R, Liu B, Xie Y et al. SOA Pdenovo 2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 2012;1:18. 10.1186/2047-217X-1-1823587118 PMC 3626529 · doi ↗ · pubmed ↗