Time‐Efficient and Informatic‐Skill‐Light Gap‐Filling for Telomere‐to‐Telomere Genome Assembly

Dong Xu; Xianjia Zhao; Lianguang Shang; Shaolong Tian; Yanchun Li; Huaming Wen; Qiang Xu; Dongxi Li; Weihua Pan

PMC · DOI:10.1002/advs.202518319·February 5, 2026

Time‐Efficient and Informatic‐Skill‐Light Gap‐Filling for Telomere‐to‐Telomere Genome Assembly

Dong Xu, Xianjia Zhao, Lianguang Shang, Shaolong Tian, Yanchun Li, Huaming Wen, Qiang Xu, Dongxi Li, Weihua Pan

PDF

Open Access

TL;DR

GapSuite is a user-friendly software toolbox that helps biologists with limited computational skills to efficiently fill genome assembly gaps using simple mouse clicks.

Contribution

GapSuite introduces two tools, Gap-Aid and Gap-Graph, enabling efficient and accessible telomere-to-telomere genome assembly with minimal bioinformatics expertise.

Findings

01

GapSuite allows users to perform gap-filling on personal computers with minimal computational skills.

02

The tools were validated using Arabidopsis thaliana, rice, human, and simulated genomes.

03

GapSuite was used to construct the first T2T genome of rice 9311 and fill gaps in a poplar genome.

Abstract

Despite remarkable advances in sequencing technologies and automated genome assembly algorithms, manual gap‐filling remains indispensable for achieving telomere‐to‐telomere (T2T) genome assemblies, a process that can take weeks or even months. Additionally, these tasks require advanced bioinformatics expertise, thereby excluding many biologists from direct participation in T2T genome projects. This severely restricts the ability to construct T2T genomes for larger populations and a wider range of species. To overcome these challenges, we developed GapSuite, an integrated auxiliary software toolbox that includes two complementary tools, Gap‐Aid and Gap‐Graph, which facilitate gap‐filling through sequence‐extension‐based and assembly‐graph‐based strategies, respectively. The two tools empower users with limited computational expertise to efficiently perform gap closure on personal…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species5

Arabidopsis thaliana Homo sapiens Oryza sativa(Asian cultivated rice · species)Oryza sativa Indica Group(Indian rice · no rank)Mus musculus(house mouse · species)

Mutations1

T2T

Figures5

Click any figure to enlarge with its caption.

Workflow of Gap‐Aid. (A) Server‐side preprocessing. Genome‐scale mapping tasks are automated using customized scripts. (B) Alignment scoring and filtering. Confidence scores are calculated, non‐gap reads are removed, and index files are generated for fast visualization. (C) Gap visualization. The gap and its flanking “shore” sequences are displayed, with shore‐aligned reads serving as starting anchors. (D) Sequence extension across gaps. Users iteratively extend sequences by selecting reliable alignments guided by confidence scores. (E) Interactive alignment selection. Selected alignments are highlighted in real time to facilitate user inspection. (F) Progressive gap filling. Reads and alignments are dynamically updated to redefine shores until both sides are connected; undo, save, and reload functions enable flexible editing. (G) Gap closure and sequence stitching. Upon completion of the extension path, Gap‐Aid trims and merges sequences to finalize gap closure. An optional automatic mode suggests candidate paths from overlapping reads.

Workflow of Gap‐Graph. (A) Standard workflow for gap filling. (B) Workflow for haplotype‐resolved (phased) gap filling.

Validation of key technologies. (A) Synteny comparison between the assembly of synthetic gaps in the A. thaliana genome generated by Gap‐Aid and the ground truth. (B) QUAST evaluation comparing original chromosomes and Gap‐Graph–reconstructed sequences in haploid and diploid rice genomes. (C) Synteny comparison between the original chromosomes and Gap‐Graph–reconstructed sequences in the diploid rice genome.

Results on rice genomes. (A) Synteny comparison between pre‐9311 and T2T‐9311 assemblies. (B) Example of gap filling (chromosome 4 of the 9311 genome) using Gap‐Aid. (C) Example of gap filling (chromosome 6 of the 9311 genome) using Gap‐Graph. The assembly graph visualization shows the aligned chromosome in blue and the gap represented by two red nodes. (D) Quality comparison between pre‐9311 and T2T‐9311 assemblies.

Results on human and simulated triploid genomes. (A) Synteny comparison between original chromosomes and Gap‐Graph–reconstructed sequences in the human genome. (B) QUAST evaluation comparing original chromosomes and Gap‐Graph–reconstructed sequences in the human genome. (C) QUAST evaluation comparing the assembly of synthetic gaps in the simulated triploid tomato genome generated by Gap‐Aid with the ground truth. (D) k‐mer–based completeness assessment comparing original chromosomes and Gap‐Graph–reconstructed sequences aligned to the simulated triploid tomato genome. (E) Synteny comparison between the assembly of synthetic gaps in the simulated triploid tomato genome generated by Gap‐Aid and the ground truth.

Funding8

—National Key Research and Development Program of China10.13039/501100012166
—National Natural Science Foundation of China10.13039/501100001809
—Agricultural Science and Technology Innovation Program10.13039/501100012421
—Youth Innovation Program of the Chinese Academy of Agricultural Sciences
—Innovation Program of Chinese Academy of Agricultural Sciences
—Project of State Key Laboratory of Tropical Crop Breeding
—Science and Technology Project of the Ministry of Agriculture and Rural Affairs, P.R. China
—Basic Research Programs of Shanxi Province

Keywords

auxiliary softwaregap fillingtelomere‐to‐telomere genome assemblytime efficiency

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsChromosomal and Genetic Variations · Telomeres, Telomerase, and Senescence · Genomics and Chromatin Dynamics

Full text

Introduction

1

In recent years, advances in long‐read sequencing technologies have made it feasible to assemble telomere‐to‐telomere (T2T) genomes [1]. This progress has not only enhanced structural and functional studies of complex genomic regions but also reduced errors such as false‐positive variant detection [2] and data contamination [3] that often arise from incomplete reference genomes used in bioinformatic analyses [4]. However, most automatic genome assembly pipelines still fail to produce fully complete chromosomal sequences and the ineffectiveness of automatic gap‐filling tools on complex genomic regions [5, 6, 7, 8], almost all these projects involve time‐consuming process of manual gap‐filling to recover missing sequences from complex genomic regions enriched with tandem and interspersed repeats. In typical T2T workflows, automatic assembly can be completed within a few days, whereas manual gap‐filling may require several weeks or even months. Moreover, manual gap‐filling demands advanced bioinformatics proficiency, including programming skills, command‐line operation in Linux, and extensive experience in genome assembly. These technical requirements restrict the involvement of many biologists who lack specialized computational expertise, thereby limiting broader participation in T2T projects. Overall, the heavy workload and technical challenges of manual gap‐filling constitute a major barrier to achieving T2T assemblies at the population scale (e.g., hundreds of individuals [9]), and across a broader range of species. This limitation largely explains why, despite the successful construction of numerous T2T‐level assemblies for individual genomes, T2T‐level pangenomes have yet to be realized [10, 11]. Looking ahead, it remains unrealistic to expect a large number of genome assembly experts to dedicate extensive time and effort to the manual curation required for T2T pangenome projects, except for a few model organisms (e.g. human, mouse, yeast). The lack of T2T pangenomes has severely hindered population‐scale studies of sequence composition, structural organization, variation, and evolution in complex genomic regions. More importantly, non‐T2T pangenomes that miss these complex regions may lead to errors in analyses such as genome‐wide association study (GWAS), resulting in the loss of key functional genes and regulatory elements—whether in complex or non‐complex regions—that are associated with important phenotypes. This limitation poses a significant barrier to progress in both biomedical research and agricultural breeding.

Therefore, easy‐to‐use auxiliary software is needed to improve the time efficiency and lower the technical barrier of manual gap‐filling. To our knowledge, no dedicated auxiliary software exists for this purpose, though some general visualization tools can help reduce the difficulty of manual gap‐filling. For example, Bandage can be used to visualize the assembly graph, helping users identify candidate sequences in gap regions [12]. Similarly, RAviz, developed by our team, provides a visualization platform for inspecting the alignment between long reads and repetitive genomic regions, thereby assisting users in selecting correctly aligned reads for subsequent gap extension [13]. Nevertheless, using these tools for manual gap‐filling remains a time‐consuming process that requires bioinformatics skills. For example, after selecting a correctly aligned read to extend the gap‐adjacent region based on visualizations from RAviz or other tools (e.g. IGV [14]), the reads must be realigned to the newly extended sequence on the server and then transferred back to the local machine for visualization in the next iteration. This workflow is not practical for large gaps that require tens of iterations. Moreover, identifying precise gap regions in the assembly graph displayed by Bandage can be challenging without cross‐referencing additional datasets, which must first be processed and analyzed separately. This multi‐step workflow not only increases the technical complexity but also significantly limits accessibility for users without advanced bioinformatics training.

To address this limitation, we developed an auxiliary software toolbox GapSuite, consisting of two tools Gap‐Aid and Gap‐Graph, which assist users through the entire manual gap‐filling process using sequence‐extension and assembly‐graph‐based strategies, respectively. With these tools, users without prior bioinformatics expertise can efficiently complete manual gap‐filling in T2T genome projects on personal computers with only a few mouse clicks, ultimately generating a complete, gap‐free assembly. Gap‐Aid visualizes the alignment of long reads to genomic regions flanking gaps and simultaneously indicates alignment reliability, enabling users to perform the gap‐filling process by progressively selecting correctly aligned reads for sequence extension. Gap‐Graph, on the other hand, visualizes the assembly graph, in which chromosomal scaffolds generated by automated pipelines are represented as interconnected paths containing unresolved gaps. Through this interface, users can easily identify and select nodes (sequences) corresponding to gap regions for targeted filling. The two tools incorporate a series of technical innovations to achieve key functions and enhance both time and space efficiency. To the best of our knowledge, no comparable software tool is currently available. The performance of these technologies, as well as the overall performance, was validated on real Arabidopsis thaliana, rice, and human genomes as well as the simulated diploid and polyploid genomes. A complete T2T genome of rice 9311, the model variety of indica rice, was assembled by filling the gaps using these tools, marking, to the best of our knowledge, the first true T2T genome of rice 9311. Compared to the recently published gapless version, T2T‐9311 improves the genome size, BUSCO [15] completeness, and QV score from 393 Mb, 98.3%, and 31.55 to 401.74 Mb, 99.6%, and 50.5, respectively. Also, Gap‐Aid was used to fill part of the remaining gaps in a recently published gapless poplar genome.

Results

2

Gap‐Aid Software

2.1

Gap‐Aid comprises two primary components: a server‐based data preprocessing module and a visualization‐based path construction module. In the preprocessing module, all time‐consuming genome‐scale mapping tasks are executed automatically on servers through customized scripts (Figure 1A). During this process, Gap‐Aid simultaneously calculates alignment confidence scores for each alignment and filters out reads mapped to non‐gap regions (Figure 1B). To enhance the efficiency of downstream visualization, index files are also generated at this stage. In the visualization and path‐construction module—which can be executed on any operating system—Gap‐Aid first displays the gap region along with its flanking sequences (referred to as “shore regions”). Reads aligned to shores are visualized, marking the starting point for path construction (Figure 1C). By examining detailed alignment views together with alignment confidence scores, users can iteratively select reliable reads from one shore to progressively extend sequences across the gap (Figure 1D). After each iteration, the visualization is automatically updated to display new flanking shores, allowing users to continue the gap‐filling process until both sides are connected (Figure 1F). During each iteration of the alignment selection process, whenever the user clicks on an alignment region, Gap‐Aid highlights the selected alignment to make identification and selection easier (Figure 1E). To assist users with limited knowledge of genome assembly and sequence alignment, Gap‐Aid visualizes a set of criteria and a unified score for each read to assess the reliability of its alignments to the shores, and generates a recommended set of reads for the users. Despite high reliability of the recommendations, the users are still allowed to select any reads for extension. Whenever the users find the sequence extension cannot be continued due to the false‐positive reads selected in the previous steps of extension, they are allowed to perform step‐by‐step ‘undo’ operations to return to the first step of wrong selection and reselect. For the convenience of the users to fill large gaps, Gap‐Aid allows to abort the process mid‐way, saving the filling status for later continuation by reloading the saved status. Once the constructed path fully spans the gap, the software automatically issues a notification. At this point, users can trigger Gap‐Aid to complete trimming and stitching of the assembled sequence, thereby finalizing gap closure (Figure 1G). In addition, Gap‐Aid provides an automatic mode that predicts multiple candidate gap‐filling solutions, each composed of overlapping reads. While most solutions may contain a few erroneous reads, it remains possible that one or more represent a fully correct assembly path, offering an optional shortcut for rapid gap closure.

Workflow of Gap‐Aid. (A) Server‐side preprocessing. Genome‐scale mapping tasks are automated using customized scripts. (B) Alignment scoring and filtering. Confidence scores are calculated, non‐gap reads are removed, and index files are generated for fast visualization. (C) Gap visualization. The gap and its flanking “shore” sequences are displayed, with shore‐aligned reads serving as starting anchors. (D) Sequence extension across gaps. Users iteratively extend sequences by selecting reliable alignments guided by confidence scores. (E) Interactive alignment selection. Selected alignments are highlighted in real time to facilitate user inspection. (F) Progressive gap filling. Reads and alignments are dynamically updated to redefine shores until both sides are connected; undo, save, and reload functions enable flexible editing. (G) Gap closure and sequence stitching. Upon completion of the extension path, Gap‐Aid trims and merges sequences to finalize gap closure. An optional automatic mode suggests candidate paths from overlapping reads.

To implement the above functions, we developed a set of novel strategies to address several key technical challenges. First, due to the sequential dependencies among gap‐filling steps, it is non‐trivial to complete all time‐consuming alignments in a single batch on the server side. For example, sequence extension requires new alignments between reads and the extended shore regions generated in the previous iteration. Because the sequences of these new shores are unknown beforehand, such alignments cannot be precomputed and must instead be performed iteratively, which is highly time‐consuming. To address this issue, Gap‐Aid adopts a strategy that substitutes the alignments of reads to newly extended shores with pairwise alignments between reads, thereby transforming the repetitive alignment tasks into a one‐time alignment process. Second, to assess the quality and reliability of alignments between candidate reads and the shore regions, and to recommend the most reliable reads to users, we developed a k‐mer–based conflict evaluation method. This approach quantifies the degree of conflict among fragmented alignments and derives five individual scores to represent alignment quality from different aspects. A unified reliability score is then calculated as a weighted average of these five metrics, where the weights are optimized using a linear regression model trained on A. thaliana data. Third, in ‘automatic’ mode, to automatically generate candidate read sequences for gap filling, Gap‐Aid constructs an overlap graph (with vertices representing reads and shores, and edges between overlapping reads) and employs a breadth‐first search (BFS) algorithm to find paths between the two vertices representing shores, with each path corresponding to a candidate read sequence. To address the issue of large gaps with numerous vertices and edges, where BFS cannot find all paths within a limited time, we modified the traditional BFS algorithm by incorporating heuristic strategies. Edges corresponding to more reliable overlaps are given higher visitation probabilities, allowing more reliable paths to be identified earlier during the execution of the BFS algorithm. Fourth, because gaps frequently occur within repetitive genomic regions, accurate visualization and interpretation of repeat alignments are essential. In such regions, alignments often appear fragmented, and identical sequences may align to multiple locations due to high sequence similarity. To remove these spurious alignments, we implemented an algorithm that identifies an optimal subset of non‐conflicting alignments and excludes the others, thereby improving visualization accuracy and alignment confidence.

Gap‐Graph Software

2.2

Gap‐Graph operates on an assembly graph provided by the user in GFA format, where nodes represent sequences (e.g., unitigs and contigs) and edges indicate the connections between these sequences, along with the chromosome‐level assembly generated by the automated pipeline and the alignments between the sequences in the graph and the chromosomes (Figure 2A). The assembly graph is visualized in a layout similar to existing tools such as Bandage, with nodes annotated by basic information including sequence length and orientation. Users can adjust the view through zooming and panning to examine specific regions of interest. Each chromosome is represented as a path in the graph, with gaps highlighted in different colors, allowing users to easily identify sequences (nodes) potentially associated with each gap and select a path of nodes to fill. Once the selection is confirmed, the software automatically fills the gap in the FASTA file using the sequences along the path. In many cases, multiple feasible paths of nodes may exist for gap filling based on the structure of the assembly graph, but only one is correct. To help users identify the most reliable path, Gap‐Graph allows the incorporation of additional supporting data such as Oxford Nanopore Technologies ultra‐long (ONT UL) reads and Hi‐C contact information. These datasets are co‐visualized with the assembly graph and chromosome paths to provide additional structural evidence. For each edge, the number of supporting links (e.g., long reads or Hi‐C read pairs) connecting the two related nodes (sequences) is used as the edge weight, with the edge color darkening in proportion to the weight. This allows users to select the most reliable (darkest) path for gap filling. For diploid and polyploid genomes, Gap‐Graph offers a feature to visualize the signal intensity (e.g., ONT UL reads or Hi‐C data) from a user‐selected node to other related nodes, facilitating haplotype‐resolved gap filling and enabling node phasing (Figure 2B).

Workflow of Gap‐Graph. (A) Standard workflow for gap filling. (B) Workflow for haplotype‐resolved (phased) gap filling.

The greatest challenge in implementing these functions lies in accurately aligning the chromosome‐level assembly to the assembly graph. Although there are graph‐alignment tools such as GraphAligner [16] and Variation graph toolkit [17], these are primarily designed for aligning short comparative sequences (e.g., reads) rather than whole chromosomes, and they can only align a limited proportion of chromosomal sequences to the assembly graph accurately. To overcome this limitation, Gap‐Graph employs an optimization algorithm to determine the most probable path for each chromosome, considering both the graph structure and the alignment quality of the sequences in the graph relative to the chromosome. The algorithm first uses a greedy strategy to obtain an initial sequence of nodes from the start to the end of the chromosome. Then, a heuristic strategy is applied to randomly adjust the nodes (sequences) until no further improvements can be made to the objective function. The objective function integrates multiple criteria to guide optimization: it maximizes path contiguity within the graph, maximizes the cumulative alignment score of the selected nodes, and minimizes redundant overlaps among sequence alignments on the chromosome. Through this combined strategy, Gap‐Graph achieves robust and efficient alignment between large‐scale chromosome assemblies and complex assembly graphs, providing a solid foundation for accurate gap filling.

Validation of Key Technologies in Gap‐Aid and Gap‐Graph

2.3

To evaluate the effectiveness of the core algorithms implemented in Gap‐Aid, we conducted a controlled validation experiment using the Arabidopsis thaliana telomere‐to‐telomere (T2T) genome [18]. A 100‐kilobase region on chromosome 1 (positions 15.0–15.1 Mb), which contains centromeric satellite repeats, was deliberately removed to create an artificial gap with a well‐defined reference‐based ground truth. The artificially gapped genome was then subjected to manual gap filling using Gap‐Aid. As the first step, we assessed the strategy designed to filter out reads originating from non‐gap regions. A total of 1 169 243 High‐Fidelity sequencing (HiFi) reads were aligned to the genome (with the gap), and those with at least one high‐quality alignment (MAPQ > 10 and alignment length > 60% of read length) were seen in the non‐gap region and filtered out, leading to only 144 120 (12.33%) reads left. These 144 120 reads include 99.55% (1110 out of 1115) of the ‘ground truth’ gap‐region reads (see ‘Methods’ for ground truth generation), demonstrating that the filtering strategy effectively removes the majority of non‐gap‐region reads while retaining almost all of the gap‐region reads. Here, we would like to emphasize that the low precision (1,110 / 144,120) and high sensitivity (1110 / 1115) were intentionally designed by our filtering strategy rather than being a drawback. This is because, for the subsequent gap‐filling task, retaining all reads from gap regions is more important than filtering out non‐gap reads more strictly.

Second, the effectiveness of the k‐mer‐based alignment‐quality scores was evaluated. The results show that the correlations of the five independent scores (CS, PLA, PNA, TLNA, and MBNNA; see ‘Methods’ for definitions) in distinguishing correct from incorrect alignments were 0.074, 0.1, 0.1, 0.15, and 0.31, respectively, while the correlation for the unified score was 0.33. Third, the gap was manually filled by iteratively selecting 20 reads with high‐quality alignments visualized for extension (Figure S1). QUAST [19] evaluation and synteny analysis demonstrated that assemblies generated by Gap‐Aid were nearly indistinguishable from the reference benchmark, achieving a genome fraction of 99.999% and an indel rate of only 0.02 per 100 kb. These results highlight the high accuracy of the Gap‐Aid–assisted assemblies (Figure 3A). To demonstrate that existing automatic gap‐filling algorithms cannot resolve this gap, we applied TGS‐GapCloser [20], LR‐GapCloser [21], FGAP [22], and RFfiller [23] with the same HiFi reads. None of these tools successfully filled the gap, inserting either no sequence or only a minimal fragment (Table S1). Finally, ONT UL reads were used to fill the same artificial gaps in order to evaluate whether the performance of Gap‐Aid depends on the sequencing data type. The resulting assembly remained highly consistent with the reference (Figure S2), suggesting that the efficiency of Gap‐Aid is largely independent of the sequencing platform.

Validation of key technologies. (A) Synteny comparison between the assembly of synthetic gaps in the A. thaliana genome generated by Gap‐Aid and the ground truth. (B) QUAST evaluation comparing original chromosomes and Gap‐Graph–reconstructed sequences in haploid and diploid rice genomes. (C) Synteny comparison between the original chromosomes and Gap‐Graph–reconstructed sequences in the diploid rice genome.

The graph‐alignment algorithm in Gap‐Graph was validated on the rice (Oryza sativa) 9311 genome and a synthetic diploid rice genome. The chromosome‐level assemblies (see ‘Validation and application on rice genomes’ for details) were aligned to the unitig‐level assembly graphs using the graph‐alignment algorithm, and the sequences in the aligned paths were compared with the original chromosomes to assess the correctness of the alignments. The QUAST evaluation and synteny analysis demonstrated that Gap‐Graph achieved highly accurate assemblies in both haploid and diploid genomes, with the reconstructed paths showing strong concordance with the original chromosomes (Figure 3C; Figure S3), and containing only a minimal number of misassemblies and indels (Figure 3B). These results demonstrate the robustness and precision of the Gap‐Graph alignment framework in reconstructing chromosome‐level genome continuity.

Going one step further, we reassembled the A. thaliana genome using HiFi and ONT UL read‐based contig‐assembly with Verkko [24] and reference‐based scaffolding with RagTag [25]. This process produced a chromosome‐level assembly containing 12 gaps, all of which were located on chromosome 2. Assisted by Gap‐Aid and Gap‐Graph, we successfully filled four and eight gaps, respectively. Synteny analysis revealed that the resulting assemblies were highly consistent with the reference A. thaliana T2T genome, except for the missing rDNA regions (Figure S4). Collectively, these findings demonstrate that both Gap‐Aid and Gap‐Graph can effectively resolve assembly gaps produced by widely used genome assembly pipelines.

Gap‐Aid and Gap‐Graph Enable T2T Assembly of Real Rice Genome (T2T‐9311) and Simulated Diploid Genome

2.4

To demonstrate the effectiveness of Gap‐Aid and Gap‐Graph in a real T2T project, we constructed a rice 9311 T2T genome by applying these tools to fill the gaps in the chromosome‐level assembly generated by a ‘state‐of‐the‐art’ automatic assembly pipeline. As part of this study, 89.7× coverage of HiFi reads (N50 of 18,919 bp), 181.3 × coverage of ONT UL reads (N50 of 74,247 bp), and 108.5 × coverage of Hi‐C reads for the 9311 genome were generated. The chromosome‐level assembly was produced by combining HiFi and ONT UL based contig assembly, Hi‐C‐based scaffolding, and Hi‐C heatmap‐assisted manual curation. This assembly contains two gaps and two missing telomeres. For the gap on chromosome 4, Gap‐Aid was used to identify a HiFi read capable of spanning the gap region, which was then employed to fill the gap (Figure 4B). The gap on chromosome 6 was identified as a false gap and removed, as Gap‐Graph revealed that the adjacent regions at both ends of the gap correspond to two neighboring nodes in the assembly graph (Figure 4C). The two missing telomeres were also filled using Gap‐Aid (Figure 4A, red boxes).

Results on rice genomes. (A) Synteny comparison between pre‐9311 and T2T‐9311 assemblies. (B) Example of gap filling (chromosome 4 of the 9311 genome) using Gap‐Aid. (C) Example of gap filling (chromosome 6 of the 9311 genome) using Gap‐Graph. The assembly graph visualization shows the aligned chromosome in blue and the gap represented by two red nodes. (D) Quality comparison between pre‐9311 and T2T‐9311 assemblies.

After polishing with HiFi and Illumina reads, we obtained a 9311 T2T genome (named T2T‐9311) with total length of 401.74Mb, BUSCO completeness of 99.6%, and QV score of 50.5, demonstrating higher quality compared to the recently published version (pre‐9311) [26] (Figure 4D). Synteny analysis was performed between the two versions (Figure 4A). Compared to pre‐9311, T2T‐9311 shows substantially improvements in sequence completeness on chromosomes 4, 8, 9, 10, and 12. Alignments with ONT UL reads further confirmed the sequence accuracy of T2T‐9311 in the divergent regions compared to pre‐9311 (Figure S5). Finally, gene annotation was done by lifting the gene coordinates from pre‐9311 to T2T‐9311.

Furthermore, we tested the effectiveness of these two tools on haplotype‐resolved T2T assembly. We started by creating a synthetic diploid rice genome using 9311 and Nipponbare as the maternal and paternal haplotypes, respectively. The HiFi, ONT UL, and Hi‐C reads, as well as the T2T assembly of the Nipponbare genome, were all generated in our previous study [27]. The read sets for the synthetic diploid genome were generated by merging the read sets for 9311 (Figure S6A) and Nipponbare (Figure S6B). A haplotype‐resolved chromosome‐scale assembly was then generated using contig assembly with hifiasm [28], scaffolding with 3D‐DNA [29], and manual refinement guided by Hi‐C heatmaps. This assembly contained three gaps (two on the maternal and one on the paternal haplotype) and three missing telomeres (one on the maternal and two on the paternal haplotype). After gap filling using Gap‐Graph and telomere completion using Gap‐Aid, we obtained a haplotype‐resolved assembly with a total size of 787.7 Mb. QUAST evaluation shows that the assembly covers 99.45% of the ground truth sequence (the haploid 9311 and Nipponbare genomes) with only 49 misassemblies across the entire genome. The synteny analysis also demonstrated a high level of consistency between the assembly and the ground truth, with the exception of a notable translocation and a duplication on chromosome 9. To investigate these discrepancies, we aligned the ground truth sequences against themselves and detected the same structural variations, indicating that the inconsistencies were caused by alignment ambiguities within complex genomic regions rather than by assembly errors (Figure S6C).

Gap‐Aid and Gap‐Graph Resolve Challenging Gaps in Complex and Polyploid Genomes

2.5

Furthermore, we assessed the performance and practical applicability of Gap‐Aid and Gap‐Graph on larger and more complex genomes. First, we applied Gap‐Graph to the centromeric regions of the human (Homo sapiens) genome (HG002) assembled using automated algorithms, aiming to close large gaps in the satellite DNA sequences of haplotype‐resolved chromosome 10 and X assemblies. For each gap, the relevant reads were recalled with guidance from the corresponding regions of the haploid human T2T reference genome (CHM13 [30]), and a local unitig‐level assembly graph was constructed using hifiasm. By visualizing the assembly graphs with Gap‐Graph, we successfully identified two candidate paths for chromosome 10 (Figure S7A,B) and one for chromosome X (Figure S7C), which were subsequently used to fill the gaps. Both synteny analysis (Figure 5A) and QUAST evaluation (Figure 5B) confirmed that the reconstructed sequences were highly consistent with the ground truth.

Results on human and simulated triploid genomes. (A) Synteny comparison between original chromosomes and Gap‐Graph–reconstructed sequences in the human genome. (B) QUAST evaluation comparing original chromosomes and Gap‐Graph–reconstructed sequences in the human genome. (C) QUAST evaluation comparing the assembly of synthetic gaps in the simulated triploid tomato genome generated by Gap‐Aid with the ground truth. (D) k‐mer–based completeness assessment comparing original chromosomes and Gap‐Graph–reconstructed sequences aligned to the simulated triploid tomato genome. (E) Synteny comparison between the assembly of synthetic gaps in the simulated triploid tomato genome generated by Gap‐Aid and the ground truth.

Second, we tested Gap‐Aid on a simulated triploid genome generated by combining reference genomes of three tomato (Solanum lycopersicum) varieties (TS2, TS281, and Heinz1706) [31], together with their corresponding sequencing reads. To introduce a large artificial gap, a 100‐kb sequence was removed from the subtelomeric tandem repeat region of the TS2 haplotype on chromosome 1. With the aid of Gap‐Aid, a set of overlapping HiFi reads was selected to fill the gap, and the software automatically produced the gap‐filled assembly (Figure S8). Independent evaluations using QUAST (Figure 5C), k‐mer analysis (Figure 5D), and synteny (Figure 5E) confirmed that the reconstructed sequence closely matched the ground truth.

Finally, as a case study, we applied Gap‐Aid to resolve the remaining gaps in a recently published haplotype‐resolved gapless poplar (Populus euphratica) genome [32]. It is reasonable to assume that these gaps could not be resolved by existing automated gap‐filling tools, and the original authors were unable to manually fill them. With Gap‐Aid, three gaps were successfully filled using overlapping HiFi reads, generating reconstructed sequences of 30 265, 15 443, and 22 090 bp in length. The HiFi reads were then mapped back to the gap regions and their flanking sequences. For all three gaps, the coverage in the gap regions was highly consistent with that of the flanking regions and comparable to the genome‐wide average coverage, supporting the correctness of the gap‐filling (Figure S9A–C). Furthermore, we performed preliminary annotation and analysis of the filled sequences. We found that the sequence filling haplotype 1 of chromosome 13 contains a mitochondria‐related gene, LOC105132550, whereas the filled sequence on haplotype 1 of chromosome 17 mainly consists of tandem repeats with a monomer length of 505 bp.

Discussion

3

Genome assembly remains one of the most complex and technically demanding challenges in genomics and bioinformatics. Historically, extensive manual curation was required to refine assemblies generated by automated algorithms in order to produce satisfactory assemblies. In recent years, with improvements in sequencing data quality and read length, the quality of assemblies generated by automated algorithms has significantly improved. Nevertheless, for highly complex genomes or structurally intricate regions within otherwise well‐assembled genomes, labor‐intensive and technically demanding manual curation is still required. To date, there are very few software tools available to assist manual curation, which means that this process can only be carried out by experienced bioinformaticians and remains highly time‐consuming. The most widely used tool is Juicebox [33], which visualizes Hi‐C heatmaps to assist users in manually adjusting the order and orientation of contigs, thereby improving scaffolding accuracy. However, its capabilities are mainly confined to structural correction, and it does not provide strategies for resolving sequence gaps or guiding read extension.

Automated gap‐filling algorithms perform well in ordinary genomic regions but face intrinsic limits in highly repetitive or structurally complex loci such as centromeres, rDNA arrays, and segmental duplications. The abundance of near‐identical repeats prevents unique read anchoring, generating ambiguous graph connections that most pipelines terminate conservatively to avoid misassemblies. These challenges intensify in polyploid or heterozygous genomes, where homologous regions further obscure overlap evidence [7]. Consequently, researchers often rely on integrated multi‐technology strategies to resolve gap regions, which substantially increase costs and hinder progress toward true T2T genome assemblies [8].

To address these limitations, we developed Gap‐Aid and Gap‐Graph, two gap‐fixed tools that can utilize different types of sequencing data (e.g. HiFi, ONT UL, Hi‐C) for semi‐automatic, user‐guided gap resolution. Gap‐Aid visualizes and extends alignments through confidence‐based selection, whereas Gap‐Graph enables path reconstruction within assembly graphs, combining human pattern recognition with computational inference. Together, these tools markedly reduce both the technical difficulty and time required to achieve near‐complete or fully T2T assemblies across haploid, diploid, and partially polyploid genomes. Despite these advantages, both tools remain constrained by the intrinsic limitations of current sequencing and assembly strategies. In particular, extremely homogeneous and tandemly amplified repeats such as rDNA arrays still pose major challenges, as even ultra‐long reads provide insufficient unique anchors for confident gap resolution in these loci. Consequently, Gap‐Aid and Gap‐Graph, like most existing approaches, are not yet capable of fully resolving such regions.

Future development of our framework will focus on incorporating variant‐level visualization into Gap‐Aid and implementing automated path recommendation in Gap‐Graph, thereby improving the efficiency of gap closure in large or structurally complex genomic regions. Beyond gap‐filling, similar interactive frameworks could support manual curation in low‐quality polyploid or metagenomic assemblies, where conventional pipelines produce misassembled or fragmented contigs. Extending this human‐in‐the‐loop paradigm to population‐scale and metagenomic projects may ultimately bridge the gap between automation and expert biological reasoning in complete genome assembly.

It is important to emphasize that Gap‐Aid and Gap‐Graph are designed as auxiliary tools specifically intended to support manual gap filling rather than replace existing automated algorithms. Their primary purpose is to reduce workload, improve efficiency, and lower the technical barrier for users, thereby enabling large‐scale T2T assemblies. Although in some cases these tools can help resolve gaps that remain unfilled by existing assembly technologies (as demonstrated in our experiments with the poplar genome), this is not their main design objective. They are designed to assist users through visualization, evidence integration, and interactive decision‐making; the success of gap filling ultimately depends on both the quality of the input data and the user's judgment in selecting correct alignments or graph paths. As sequencing technologies continue to advance, the assembly of individual T2T genomes will likely become a routine process. The next frontier will involve constructing population‐scale and multi‐species T2T assemblies, where Gap‐Aid and Gap‐Graph are expected to play a pivotal role in integrating human expertise with automated inference, paving the way toward comprehensive and interpretable genomic reconstruction at unprecedented scale.

Methods

4

Gap‐Aid: Preprocessing on Server Side

4.1

The input data on the server‐side includes (1) the chromosome‐level assembly generated by automatic genome assembly pipeline, (2) whole‐genome long reads for gap filling, and (3) contigs used to generate the chromosome‐level assembly (optional, and used to obtain a more complete set of unique k‐mers than from chromosomes). The process begins by generating two alignment files, either using minimap2 [34] or winnowmap2 [35] (chosen by users). One file contains alignments between the reads and chromosomes, while the other contains pairwise alignments between reads. Next, the reads from non‐gap regions are filtered out to enhance space and time efficiency. To ensure that reads from gap regions are not removed, genomic regions of a specified length (e.g. 500 kb) before and after each gap in the chromosome‐level assembly are first masked as ‘N's. Then the reads are aligned to the masked assembly. Reads with at least one high‐quality alignment (MAPQ > 10 and aligned length > 500 bp) are considered to be from non‐gap regions and are removed. Consequently, the alignments associated to the removed reads are also removed from the pairwise alignment file. Next, the two alignment files are processed to filter out conflicting alignments, as described in the Section 4.2. Afterward, a series of scores are calculated to evaluate the quality of alignments, with detailed calculations provided in the Section 4.3. For the ‘automatic’ mode, the candidate read sequences are generated for each gap, and reliability scores for these sequences are computed. The process for generating these candidate sequences is detailed in the Section 4.4.

Gap‐Aid: Removal of Incorrect Alignments

4.2

To remove incorrect alignments between reads or between reads and shores, an algorithm is used to identify an optimal non‐conflict subset of alignments. Formally, for a reference (either a shore or a read) r and a query (a read) q aligned to r, let A represent the set of fragmented alignments between r and q, where each a∈A is an alignment between a fragment of r and a fragment of q. The objective of this algorithm is to obtain a largest (largest number of alignments) possible subset A′ of A in which every pair of alignments is non‐conflict. To define ‘non‐conflicting’ formally, let *a_i_

be an alignment of a fragment *q_i_
of q to a fragment *r_i_
of r and *a_j_
be an alignment of a fragment *q_j_
of q to a fragment *r_j_
of r. Assuming the starting coordinate of *r_i_
is smaller than or equal to that of *r_j_ *, we consider *a_i_
and *a_j_
to be non‐conflicting if the starting coordinate of *q_i_
is also smaller than or equal to that of *q_j_ *. Due to the high complexity of this optimization problem, the algorithm aims to obtain a suboptimal solution by solving a well‐known optimization problem called Longest Increasing Subsequence (LIS) [36, 37]. Specifically, the alignments *a_i_
in A are sorted by the starting coordinates of their corresponding fragments *r_i_
on the reference r in increasing order. This sorting allows for the construction of a sequence *S_q_
of the corresponding fragments *q_i_
on the query q, maintaining the same order. Next, a sequence *S_c_
of the starting coordinates of *q_i_
is constructed, preserving the order of *S_q_ *. The algorithm then calculates the longest (largest number of values) increasing subsequence *S_c_ *′ of *S_c_ *, where for any pair of values x and y, x ≤ y if x precedes y. This step is efficiently performed using a binary search [36]. Finally, the optimal non‐conflicting subset A′ of alignments is obtained by collecting all alignments from A whose corresponding starting coordinates of fragments *q_i_
on the query are contained in *S_c_ *′.

Gap‐Aid: Reliability Evaluation of Alignments, Candidate Reads, and Read Sequences

4.3

To assist users in selecting reads during the sequence extension process, Gap‐Aid offers five criteria for evaluating the reliability of alignments between each pair of aligned sequences (whether between a read and a shore or between two reads). These criteria include: conflict score (CS), proportion of the largest alignment (PLA), total length of non‐conflicting alignments (TLNA), proportion of non‐conflicting alignments (PNA), and the number of matched bases in non‐conflicting alignments (MBNNA).

To calculate CS, a certain number of k‐mers are randomly selected from the reference (either a shore or a read) within the aligned regions, and the positions of the corresponding k‐mers on the query (a read) are determined. To minimize computational complexity, the positions of k‐mers on the query are derived based on the starting and ending coordinates on both the reference and query of the alignment to which the reference k‐mer belongs (see Figure S10A). For an alignment with starting and ending coordinates on the reference and query given as *r_s_ *, *r_e_ *, *q_s_ *, *q_e_

respectively and a k‐mer on the reference with coordinate *r_k_ *, the coordinate *q_k_
of the corresponding k‐mer on the query can be calculated as

[eqn]

where *r_l_

and *q_l_
represent the alignment lengths on reference and query respectively and are computed as *r_l_
= *r_e_ * − *r_s_
and *q_l_
= *q_e_ * − *q_s_ *. After determining *r_k_
and *q_k_
for each pair of k‐mers, CS is calculated as

[eqn]

PLA is defined as the length of the largest alignment divided by the length of the ideal alignment region. The ideal alignment region is illustrated in Figure S10B, and its position can be estimated based on the coordinates of the largest alignment. As shown in the figure, the ideal alignment region is defined as the area between reference region from 1 to *r_e_

- (*q_L_
− *q_e_ *) and the query region from *q_s_
− *r_s_
to *q_L_
where *r_s_ *, *r_e_ *, *q_s_ *, *q_e_
represent the starting and ending coordinates on the reference and query for the largest alignment respectively and *q_L_
is the total length of the query.

To calculate TLNA, PNA, and MBNNA, the non‐conflict alignments must first be identified (see Figure S10C). The process begins with merging the overlapped alignments. We require that the overlaps on both reference and query are each longer than 500 bp, and that the difference between the overlap lengths on the reference and query is smaller than the smaller of the two overlap lengths. Next, the merged alignments are compared to the largest alignment, and conflict alignments are detected. For each alignment, we calculate the distances *d_r_

and *d_q_
to the largest alignment on the reference and query, respectively. An alignment is considered a conflict alignment if max(dr,dq)>2min(dr,dq). The distance between alignments is defined as the difference between the starting coordinate of the later alignment and the ending coordinate of the earlier one. After removing the conflict alignments, the following three criteria can be calculated using the remaining non‐conflict alignments: TLNA is the total length of non‐conflict alignments. PNA is defined as TLNA divided by the length of the ideal alignment region. MBNNA is the sum of the matched base numbers of all non‐conflicting alignments, with each number provided by the alignment file.

To further assist users in read selection, Gap‐Aid provides a unified reliability score (URS), which is the weighted sum of the five criteria mentioned earlier. To determine the optimal weights, we built a linear regression model as

[eqn]

The weights (β’s) are trained on centromeric tandem repeats of A. thaliana genome. In addition to these five criteria, Gap‐Aid also provides a unique k‐mer based reliability score called kMAPQ for each individual alignment. A unique k‐mer is defined as a substring of length k that appears in only once in the entire genome. kMAPQ is defined in our previous paper and generated by RAfilter [38]. In ‘automatic’ mode, Gap‐Aid provides a reliability score for each candidate read sequence, calculated as the average URS of the pairs of adjacent reads in the sequence.

Gap‐Aid: Generation of Candidate Read Sequences

4.4

To enable the functions in ‘automatic’ mode, candidate read sequences are generated for each gap. Specifically, an overlap graph is constructed, where vertices represent reads and shores, and edges connect overlapping reads. The weight of each edge corresponds to the URS score for the alignment between the two sequences represented by the vertices. For each gap, a heuristic Breadth‐First Search (BFS) is performed on the overlap graph to identify a set of paths that contain edges with relatively high reliability scores, connecting one shore to the other. Each of these paths represents a reliable candidate read sequence. In the heuristic BFS algorithm, when visiting a vertex v, we randomly select a fixed number k (e.g. 5 or 1) of unvisited successor vertices according to the probabilities determined by the corresponding edge weights. In other words, successor vertices with higher weights are selected with higher probabilities. The algorithm terminates once the vertex representing the other shore is visited. The parameter k plays a crucial role in the algorithm's performance. For smaller gaps or higher‐performance computing devices, a larger k value is recommended, as it allows the algorithm to find more reliable paths and thus increases the probability of identifying true candidate read sequences. Conversely, for larger gaps, a smaller k value should be chosen to ensure that the termination vertex (the shore) can be reached within a reasonable time frame. When k = 1, the heuristic BFS algorithm essentially degenerates into a brute‐forth path search with a greedy strategy, which may be suitable for special cases such as very large gaps with relatively few incorrect alignments (e.g. low‐repetitive regions).

Gap‐Aid: Implementation on Client‐Side

4.5

The software was developed using Python, with the PyQt5, QFluentWidgets, and PyQtGraph libraries used to design the graphical user interface (GUI). These libraries handle layout organization, styling, tooltips, and user interaction features. An iterative file traversal function was implemented to automatically read the required files. To improve runtime efficiency, an index file construction mechanism was added. Python scripts generate SVG images based on input file content, which are then automatically displayed using the QtSvg module in PyQt5. The program was compiled into a Windows executable using the Nuitka package, enabling stand‐alone distribution and execution on the target platform. Additionally, by implementing remote graphical user interface presentation, the program can be equipped with a GUI on Linux systems, ensuring full functionality and usability in the Linux environment.

Gap‐Graph: Aligning the Chromosome‐Level Assembly to the Assembly Graph

4.6

The sequences of nodes in the assembly graph are aligned to the chromosome‐level assembly using minimap2, with the low‐quality alignments (MAPQ < 30) being filtered out. An optimization algorithm is then applied to determine the most probable path of nodes for each chromosome, considering both alignment quality and graph structure. For each alignment a, a score q is calculated to assess its quality. For nodes with sequence lengths smaller than 100 000, the score is calculated as

[eqn]

For larger nodes, the score is

[eqn]

where *l_b_ *, *l_s_ *, and m represent the alignment length, sequence length, and the number of matched bases in the alignment, respectively.

The algorithm consists of two steps. First, it generates an initial path of nodes using a greedy strategy. The node with the alignment of the smallest coordinate and a quality score q greater than a threshold is selected as the starting point for the path. The thresholds for q are defined as q > 0.98 for l ∈ (0, 50 000), q > 0.95 for l ∈ [50 000, 100 000), q > 0.93 for l ∈ [100 000, 500 000), and q > 0.85 for l ∈ [500 000, ∞). The successor of the first node, having the highest quality score, is selected as the second node. This process is repeated until either the entire chromosome is successfully aligned or the current node has no successor. If the current node has no successor, the algorithm checks for the presence of a gap region. If a gap is detected, a new first node is selected for the genomic region beyond the gap, and the path‐searching process continues from this new node. If no gap is found, the algorithm backtracks to the previous nodes and explores alternative feasible paths.

S, an Optimization Process Is Applied to Refine the Initial Path. The Loss Function Is Defined as

[eqn]

where i, o, z, u and n represent the numbers of nodes in the path with indegree > 1, with outdegree > 1, with indegree = outdegree = 0, with indegree = outdegree = 1, and the total number of nodes in the current path, and *l_j_ *, *l_chr_ *, and *s_j_

represent the sequence length of j th node, the total sequence length of the nodes in the current path, and the quality score of the sequence alignment of the j th node. This loss function ensures the generation of a continuous, complete, reliable, non‐branching path in the graph. The algorithm iteratively optimizes the loss function by randomly modifying the nodes in the path until the loss score no longer decreases or the iteration limit (1000) is reached.

Gap‐Graph: Implementation on Client Side

4.7

The PC‐end implementation of Gap‐Graph was developed using the Electron framework, with web technologies (HTML, CSS, JavaScript) used to build the user interface. The application leverages the Sigma.js and Graphology libraries for efficient rendering and interactive visualization of complex graph structures. Sigma.js supports GPU‐based rendering, ensuring smooth performance even with large graphs containing numerous nodes and edges, thus enhancing the user experience. Additionally, the software employs thread pooling and multithreading techniques to efficiently process TB‐level files, significantly reducing processing time.

Validation of Key Technologies

4.8

In the experiments with A. thaliana genome, all used sequencing data was downloaded from the National Genomics Data Center (NGDC) database under project number PRJCA007112.

To validate the strategy for filtering out non‐gap‐region reads, all HiFi reads were aligned to the ‘ground truth’ gap‐region sequence. A total of 1,115 reads with high‐quality alignments (MAPQ > 10 and alignment length > 60% of the read length) were identified as the ‘ground truth’ gap‐region reads.

In the validation of alignment scores, we constructed a vector for each of the five types of quality scores, the unified quality score, and the ‘ground truth’. For each type of quality score, each vector entry corresponds to the score of a single alignment, while the ‘ground truth’ vector is binary, with each entry indicating whether the alignment is correct (value = 1) or incorrect (value = 0). We then calculated the correlations between these vectors. In this experiment, all alignments with ‘ground truth’ were divided into training and testing sets, containing 80% and 20% of the alignments, respectively. The training set was used to estimate the parameters of the linear regression model used to calculate the unified score, and the results presented were obtained from the testing set.

In this section and throughout the rest of the paper, all sequence alignments were performed using minimap2 (v2.26‐r1175). QUAST (v5.2.0) was employed to compare the assemblies with the ‘ground truth’ using default parameters. Synteny analysis between the assemblies and the ‘ground truth’ was conducted using SyRI (v1.6.3) [39].

Sample Collection and Sequencing of Rice 9311 Genome

4.9

Young leaves of Oryza sativa (rice) cultivar 9311 were collected from the Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (Guangdong Province, China), and immediately flash‐frozen in liquid nitrogen. Genomic DNA was extracted using the DNeasy Plant Mini Kit (Qiagen), according to the manufacturer's instructions. Library construction and sequencing were carried out by the Genome Center of Grandomics (Wuhan, China). For HiFi sequencing, two single‐molecule real‐time (SMRT) cells were sequenced on the PacBio Sequel II platform, generating 35.9 Gb of data using Circular Consensus Sequencing (CCS) (https://github.com/PacificBiosciences/ccs) with default parameters. For ONT UL sequencing, the library was prepared following the method described in Wang et al., and sequencing was performed on the Oxford Nanopore Technology PromethION platform, generating 72.5 Gb of ONT UL reads. The Illumina NovaSeqX‐plus platform was used to produce 43.4 Gb of Hi‐C reads and 21.4 Gb of standard short reads. The Hi‐C library was constructed and sequenced according to the protocol outlined in Rao et al., with DpnII (NEB) as the restriction enzyme used for Hi‐C library preparation.

Validation and Application on Rice Genomes

4.10

To generate the chromosome‐level assembly for the 9311 genome, hifiasm (v0.19.8–r603) was used to generate contigs using HiFi and ONT UL reads with the parameters “‐l 0 –primary”, and 3D‐DNA was used to do scaffolding with default parameters. Finally, the Hi‐C map of chromosomes was visualized and manual curations such as removing contigs and rearranging their order, was performed using Juicebox. The process for generating the chromosome‐level assembly of the synthetic diploid rice genome was largely the same as for the 9311 genome. The key difference is that, for a diploid genome, hifiasm generates two sets of contigs corresponding to the two haplotypes, and 3D‐DNA performs scaffolding for each set separately.

QUAST (v5.2.0) and SyRI (v1.6.3) were utilized to evaluate the assemblies and perform synteny analysis between the assemblies and the ‘ground truth,’ following the methodology described in “Validation of key technologies”.

For gene annotation, Lifton (v1.0.5) [40] was employed to map the gene coordinates from a previously published 9311 genome onto T2T‐9311, applying strict criteria (coverage = 100% and identity > 90%). Additionally, the software identified extra gene copies not annotated in the earlier genome, using filtering parameters of coverage = 100% and identity > 99%.

Validation and Application on Complex and Polyploid Genomes

4.11

In experiments on the human genome (HG002), we focused on the large gaps in the centromeric regions of chromosome 10 (both haplotypes) and the X chromosome. HiFi reads (114.9 Gb, 36× coverage) covering these gap regions were retrieved using the read‐recalling module of TRFill [31] and used to construct a unitig‐level assembly graph with hifiasm (v0.19.5‐r587). ONT UL reads (53.7 Gb, 17 × coverage) were then aligned to the unitig‐level assembly graph using Graph‐Aligner (v1.0.20) with default parameters to generate the final GFA file. Subsequently, the unitigs corresponding to the three gap regions were manually assembled based on the assembly graph and supporting information visualized in Gap‐Graph. The quality of the assembled sequences in these centromeric regions was evaluated by comparing them to the corresponding ground truth sequences using QUAST (v5.3.0) and SyRI (v1.7.0), both with default parameters. The HG002 HiFi and ONT UL sequencing data, as well as the ground truth sequences, are available at https://github.com/marbl/HG002.

In experiments on the simulated triploid genome, the haplotype‐resolved genome sequences, HiFi reads, and ground truth were generated in our previous study [31]. This genome was constructed by combining the reference genomes of three tomato varieties (TS2, TS281, and Heinz1706), with each genome representing one haplotype of the triploid. A 100‐kb gap (positions 678,535‐778,535) was introduced in the subtelomeric tandem repeat region of chromosome 1 in the TS2 haplotype, which was subsequently filled manually using Gap‐Aid. The workflow was as follows: first, the preprocessing script pipeline_hifiasm.sh was run with default parameters to generate the files required for gap filling; then, Gap‐Aid was used for manual gap filling. The phasing accuracy of the filled sequences was evaluated by generating synteny plots with SyRI (v1.7.0) (default parameters) against the corresponding sequences of TS2, TS281, and Heinz1706. Using the TS2 sequence as a reference, the quality of the filled sequence was further assessed with QUAST (v5.3.0) and the k‐mer–based accuracy evaluation tool GEVA [41].

In experiments on the poplar genome, HiFi sequencing data and the haplotype‐resolved genome sequences were downloaded from NGDC (PRJCA029103), with HiFi data at ∼30× coverage (29.6 Gb) [30]. Three gaps were randomly selected for testing: for haplotype 1, gaps on chromosomes 13 and 17 were chosen, and for haplotype 2, a gap on chromosome 19 was selected. The preprocessing script pipeline_hifiasm.sh of Gap‐Aid was used to generate the files required for gap filling, and Gap‐Aid was then applied to complete the gap‐filling process. After filling, coverage of the filled regions was assessed using GCI [42] (v1.0) with default parameters to evaluate the quality of the assemblies. Repetitive sequences in the filled regions were analyzed using TRF [43] with parameters “2 5 7 80 10 50 2000” (The repeat units and their copy numbers are provided in Table S2). Finally, the filled sequences were subjected to coding region identification via NCBI [44] and functionally annotated using the GFAP [45] program.

Author Contributions

W.P., D.L., and D.X. planed and designed the research project. D.X. and X.Z. developed Gap‐Aid, while S.T. developed Gap‐Graph. H.W., Q.X., Y.L., and L.S. evaluated the functionality and effectiveness of the software. W.P., D.X., Y.L., and X.Z. wrote the manuscript, and W.P. and D.X. revised it.

Funding

This work was supported by the grants from the National Key R&D Program of China (2025YFC3410300); the National Natural Science Foundation of China (Grant No. 32470678); the Agricultural Science and Technology Innovation Program (CAAS‐ZDRW202503); the Youth Innovation Program of the Chinese Academy of Agricultural Sciences (Y2025QC36); the Agricultural Science and Technology Innovation Program (CAAS‐CSIAF‐202301); the Project of State Key Laboratory of Tropical Crop Breeding (NO.SKLTCBZRJJ202502); the Science and Technology Project of the Ministry of Agriculture and Rural Affairs, P.R. China; and Basic Research Programs of Shanxi Province (202303021211069).

Conflicts of Interest

The authors declare no conflict of interest.

Supporting information

Supporting File 1: advs73202‐sup‐0001‐FiguresS1‐S10.pdf.

Supporting File 2: advs73202‐sup‐0002‐TablesS1‐S2.xlsx.

Supporting File 3: advs73202‐sup‐0003‐VideoS1.mp4.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1J. Wang , D. Xu , Y. L. Sang , et al., “A Telomere‐to‐Telomere Gap‐Free Reference Genome of Chionanthus Retusus Provides Insights into the Molecular Mechanism Underlying Petal Shape Changes,” Horticulture Research 11, no. 12 (2024): uhae 249, 10.1093/hr/uhae 249.39664691 PMC 11629972 · doi ↗ · pubmed ↗
2S. Aganezov , S. M. Yan , D. C. Soto , et al., “A Complete Reference Genome Improves Analysis of Human Genetic Variation,” Science 376, no. 6588 (2022): eabl 3533, 10.1126/science.abl 3533.35357935 PMC 9336181 · doi ↗ · pubmed ↗
3A. Rhie , S. Nurk , M. Cechova , et al., “The Complete Sequence of a Human Y Chromosome,” Nature 621, no. 7978 (2023): 344–354, 10.1038/s 41586-023-06457-y.37612512 PMC 10752217 · doi ↗ · pubmed ↗
4Z. Liu , N. Wang , Y. Su , et al., “Grapevine Pangenome Facilitates Trait Genetics and Genomic Breeding,” Nature Genetics 56, no. 12 (2024): 2804–2814, 10.1038/s 41588-024-01967-5.39496880 PMC 11631756 · doi ↗ · pubmed ↗
5E. D. Jarvis , G. Formenti , A. Rhie , et al., “Semi‐Automated Assembly of High‐Quality Diploid Human Reference Genomes,” Nature 611, no. 7936 (2022): 519–531, 10.1038/s 41586-022-05325-5.36261518 PMC 9668749 · doi ↗ · pubmed ↗
6J. Wong , L. Coombe , V. Nikolić , et al., “Linear Time Complexity De Novo Long Read Genome Assembly with Gold Rush,” Nature Communications 14, no. 1 (2023): 2906, 10.1038/s 41467-023-38716-x.PMC 1020294037217507 · doi ↗ · pubmed ↗
7F. Chen , “Plant Genomes:Ttoward Goals of Decoding Both Complex and Complete Sequences,” Ornamental Plant Research 2, no. 1 (2022): 24, 10.48130/OPR-2022-0024. · doi ↗
8Y. Zhou , J. Zhang , X. Xiong , Z.‐M. Cheng , and F. Chen , “ De Novo Assembly of Plant Complete Genomes,” Tropical Plants 1, no. 1 (2022): 7, 10.48130/TP-2022-0007. · doi ↗