Gap-free telomere-to-telomere genome assembly of marbled flounder (Pseudopleuronectes yokohamae)
Weiwei Zheng, Aijun Cui, Zhen Meng, Ning Zhang, Yan Jiang, Jichang Zheng, Rixiang Zhao, Wenteng Xu, Yongjiang Xu

TL;DR
The first complete genome of marbled flounder is assembled, offering insights into its evolution and traits.
Contribution
This study provides the first gap-free telomere-to-telomere genome assembly for marbled flounder.
Findings
The genome was assembled onto 24 chromosomes with 48 identified telomeres.
99.29% complete BUSCOs and over 99.89% read mapping confirm high assembly quality.
22,778 protein-coding genes and 121.02 Mb of repeating elements were identified.
Abstract
Marbled flounder (Pseudopleuronectes yokohamae) is becoming a commercially important flatfish species in Northeast Asia due to its strong environmental adaptability and excellent nutritional value. However, no high-quality marbled flounder reference genome was reported to date, which greatly limits the studies of evolutionary and functional genomics. Here, we reported the first gap-free T2T genome in flatfish (marbled flounder), with length of 582.73 Mb (contig N50: 26.29 Mb) combing short reads, PacBio HiFi long reads, ONT ultra-long reads, and Hi-C data. All of the genome sequences were assembled onto 24 chromosomes, and 48 telomeres were identified on both ends of all chromosomes. 99.29% complete BUSCOs were identified, demonstrating a high level of completeness. The average mapping ratio of short reads, PacBio HiFi reads, and ONT ultra-long reads aligned to the genome was more than…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5- —National Science and Technology Innovation 2030-Major Program (2023ZD04055),the project by the National Marine Genetic Resource Center, Ministry of Agriculture and Rural Affairs Special Project-Collec
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Chromosomal and Genetic Variations · Identification and Quantification in Food
Background & Summary
Marbled flounder, Pseudopleuronectes yokohamae (FishBase ID: 1877), belonging to Pleuronectiformes, Pleuronectidae, is widely distributed in Northwest Pacific, from southern Hokkaidoof Japan to the Yellow Sea, the Gulf of Bo Hai, and the northern part of the East China Sea, including Korea^1–4^. It was once an important fishery resource, but due to overfishing, environmental pollution, and habitat degradation, its annual catch has decreased substantially from more than 500 t in 1980s to around 50 t in the 2000s^5–7^. In recent years, it has become an economically important aquaculture species in South Korea, Japan and north of China, because of its tolerance to low temperature, strong adaptability, delicate taste, and high nutritional value^8,9^.
Extensive studies of marbled flounder have been conducted, including life history and ecology^4,10^, nutrition^8^, growth^5^, reproductive biology^4,11^, genetic diversity and population structure^9,12–14^, sperm cryopreservation^15^, and stress response^16,17^. However, the lack of high-quality reference genome has greatly hindered the comprehensive genetic and functional genomic research. To date, only a contig-level marbled flounder genome with low continuity and completeness (contig N50 length of 2 kb assembled by Illumina sequencing) was reported^18^. Therefore, there is an urgent need to obtain a high-quality marbled flounder genome.
The rapid development of sequencing technologies and assembly algorithms, especially long-read sequencing technologies like PacBio high-fidelity (HiFi), and Oxford Nanopore Technologies (ONT) ultra-long sequencing, has made it possible to obtain high-quality telomere-to-telomere (T2T) genome. Several near-T2T or T2T genome assemblies have been reported in fish, such as Chinese sea bass (Lateolabrax maculatus)^19^, the giant grouper (Epinephelus lanceolatus)^20^, the tomato hind (Cephalopholis sonnerati)^21^, African catfish (Clarias gariepinus)^22^, Yadong trout (Salmo trutta)^23^, Asian icefish (Protosalanx chinensis)^24^, and zig-zag eel (Mastacembelus armatus)^25^.
In the present study, we have assembled a gap-free T2T genome of marbled flounder combining whole genome short-read sequencing, PacBio HiFi sequencing, ONT ultralong sequencing, and Hi-C sequencing data. To our knowledge, this is the first gap-free T2T genome reported in flatfish (Fig. 1). This gap-free T2T genome assembly not only provides a robust foundation for identifying the key candidate genes of economic traits, but also advances the future evolutionary genomics within the family Pleuronectidae.Fig. 1. Features of marbled flounder T2T genome assembly. Tracks from outer to inner represent the chromosomes (a), GC content (b), distribution of gene density (c), distribution of the total repetitive sequences density (d), distribution of LTRs (e), distribution of LINEs (f), distribution of DNE TEs (g).
Methods
Sample collection, library construction and sequencing
A two-year-old female marbled flounder was collected for sampling from Zongzhe Marine Technology Co., Ltd, Yantai, Shandong, China. Fresh muscle tissues were sampled and stored at −80 °C for Genomic DNA (gDNA) extraction. Ten different tissues, including muscle, brain, pituitary, gill, liver, spleen, kidney, intestine, skin, and gonad, were collected for RNA extraction. For whole genome short-read sequencing and Hi-C sequencing, the gDNA was extracted using the TIANamp Marine Animals DNA Kit (Tiangen, China). Furthermore, for ONT ultra-long sequencing and PacBio HiFi sequencing, a standard sodium dodecyl sulfate (SDS) extraction method was used for the gDNA isolation. For RNA-seq sequencing, RNA was extracted using RNA Easy Fast Tissue/Cell Kit (Tiangen, China). The quality and quantification of gDNA and RNA was determined using agarose gel electrophoresis, NanoDrop One spectrophotometer (Thermo Fisher Scientific, USA), and Qubit 3.0 Fluorometer (Life Technologies, USA) to meet the requirements for library construction.
For short-read sequencing, a 200–400 bp short-read library was constructed using MGIEasy Universal DNA Library Prep Kit V1.0 (MGI, China) following the standard protocol, and the qualified library was sequenced on the DNBSEQ-T7RS platform (MGI, China). As a result, a total of 73.96 Gb (128×) of short-read sequencing data was generated (Table 1). For PacBio HiFi sequencing, a SMRTbell library was prepared using the SMRTbell Prep Kit 3.0 (Pacific Biosciences, CA, USA) according to the manufacturer’s instruction, and the qualified library then was performed on the PacBio Revio platform (Pacific Biosciences, CA, USA) for sequencing with the CCS mode. Finally, we obtained 31.47 Gb (54×) of PacBio HiFi sequencing data with an average and N50 read length of 17.41 kb and 17.51 kb, respectively (Table 1). The Hi-C library construction included cross-link with formaldehyde, DNA digestion, end repair and biotin labelling, blunt-end ligation, and DNA purification and shearing. After quality control, the library was implemented on DNBSEQ-T7RS platform for sequencing. In total, 58.98 Gb (102×) of Hi-C sequencing data was obtained (Table 1). For ONT ultra-long sequencing, the library was produced using the Oxford Nanopore SQK-ULK001 ligation kit, and the purified library was subsequently sequenced on a PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK). Finally, 25.77 Gb (44 × ) of ONT ultralong sequencing data with mean and N50 read length of 101.84 kb and 100.00 kb, respectively (Table 1). To assist in annotating the coding genes, RNA from ten different tissues was mixed into a total RNA for RNA sequencing. A pair-ended sequencing library was constructed and sequenced on DNBSEQ-T7RS platform. A total of 16.53 Gb RNA-seq data were generated (Table 1).Table 1. Statistics of the sequencing data.Library TypeSequencing PlatformTotal Bases (Gb)Reads Mean Length (bp)Reads N50 (bp)Q30 rate (%)WGS short readsDNBSEQ-T773.96150—97.92Hi-CDNBSEQ-T758.98150—92.76PacbioPacbio Revio31.4717,40717,505—ONTPromethION25.77101,848100,000—RNA-seqDNBSEQ-T716.53150—97.57
Genome assembly and telomere identification
We first evaluated the genome heterozygosity (0.71%) of marbled flounder using K-mer analysis. The primary contigs of marbled flounder genome were initially assembled using PacBio HiFi reads, ONT reads, and Hi-C data by Hifiasm (v0.19.9)^26^ with the parameter of “--n-hap 2”. The Purge Haplotigs pipeline (v1.1.3)^27^ with the parameter “-a 70--depth 250” was used to identify and remove haplotypic duplication of the primary genome. To obtain the chromosome-level genome, Hi-C reads were aligned to the contigs using BWA (v0.7.12)^28^ and filtered out low quality reads using a HiC-Pro pipeline (v3.1.0)^29^. After de-duplication, the valid pairs were used to anchor, order, and orient scaffolds onto chromosomes by juicer (v1.6)^30^ and 3D-DNA pipeline (v180992)^31^. Subsequently, the Hi-C contact map was interactively visualized and manually corrected using Juicebox (v1.91)^32^ based on the fact that the interaction within chromosomes is greater than that between chromosomes, the interaction at close range is greater than that at long range (Fig. 2a). Then, all ONT ultra-long reads were mapped to the chromosome-level genome using minimap2^33^ to extend the telomere sequences and were utilized to fill gaps by TGS-GapCloser^34^ to lengthen the contigs. After telomere extension and gap filling, Pilon (v1.23)^35^ was used to polish the genome assembly with short-read data. Finally, we obtained a gap-free T2T marbled flounder genome assembly with length of 582.73 Mb (contig N50 length of 26.29 Mb), which significantly improved the completeness and continuity compared to the published contig-level marbled flounder genome (GCA_000787555.1)^18^ (contig N50 length of 0.002 Mb) (Table 2). We further identified the occurrences of characteristic sequences (AACCCT/AGGGTT) in the telomere region by searching the whole genome using quartet (v1.2.5)^36^. As a result, a total of 48 telomeres were screened on both ends of 24 chromosomes (Fig. 2b and Table 3).Fig. 2. Hi-C heatmap and T2T genome display map of marbled flounder. (a) The Hi-C heatmap of chromosome interactions in marbled flounder. (b) Distribution map of telomeres and centromeres on the T2T marbled flounder genome. Purple semicircles indicate predicted telomeres.Table 2. Statistics of marbled flounder genome assemblies.GCA_051314325.1GCA_000787555.1^18^Total genome length (Mb)582.73547.80Total chromosome length (Mb)582.73—Number of chromosome24—Number of contigs24525,502Number of scaffolds24—Contig N50 (Mb)26.290.002Scaffold N50 (Mb)26.29—Table 3. Statistics of marbled flounder chromosomes and telomeres.ChromosomeLength(bp)Number of TelomeresNumber of gapschr130,859,92210chr229,635,12110chr329,259,40710chr428,634,31710chr528,352,01510chr627,732,59210chr727,379,01010chr827,124,85810chr927,007,59610chr1026,883,91510chr1126,290,30710chr1225,510,88210chr1324,151,38410chr1423,743,41410chr1523,742,42910chr1622,713,76510chr1721,626,43710chr1821,213,81310chr1920,557,02610chr2020,517,14410chr2119,800,74410chr2217,115,88510chr2317,079,77210chr2415,795,15710
Repeat annotation
Repetitive sequences were predicted by a strategy of combining de novo prediction and homologous alignments. de novo repeat elements were predicted using RepeatMasker (v4.0.5)^37^ based on the de novo database generated by RepeatModeler (v1.0.8)^38^ and LTR-FINDER (v1.0.6)^39^. Homologue prediction was performed using RepeatMasker (v4.0.5) and RepeatProteinMask (v4.0.5) based on the Repbase library (v21.01)^40^. Furthermore, Tandem Repeats Finder (v4.0.7)^41^ was used to detect the tandem repeats. Finally, a total of 121.02 Mb of non-redundant repetitive sequences representing approximately 20.77% of the marbled flounder genome were predicted (Fig. 1 and Table 4). Among these repeats, DNA transposons (9.49%) were the most abundant, followed by long interspersed elements (LINEs, 3.42%), long terminal repeats (LTRs, 3.21%), and short interspersed nuclear elements (SINEs, 0.65%) (Fig. 1 and Table 4).Table 4. Statistics of repetitive sequences in marbled flounder.TypeRepbase TEsTE protiensDe novoCombined TEsLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeDNA34,032,0845.843,906,7820.6731,573,7485.4255,324,4289.49LINE13,059,9712.246,635,8361.1411,490,5911.9719,928,9383.42SINE2,523,7790.4300.001,875,2050.323,801,3000.65LTR10,345,2051.773,310,2060.579,552,5291.6418,701,0283.21Satellite3,122,2220.5400.003,276,2660.566,312,9321.08Simple_repeat00.0000.0000.0000.00Other2,1170.0000.0000.002,1170.00Unknown613,1130.1110,0680.0026,311,5044.5226,745,0694.59Total56,542,4869.7013,853,8322.3882,560,65814.17121,017,05720.77
Protein-coding gene prediction and functional annotation
Gene structure prediction was performed through a combination of de novo prediction, homology-based prediction, and transcriptome-based prediction. For de novo prediction of the gene structures, GENSCAN (v1.0)^42^ and AUGUSTUS (v3.3.2)^43^ were used with default settings. For homology-based prediction, the protein sequences from 5 closely related representative fish species, including Platichthys flesus, Pleuronectes platessa, Hippoglossus hippoglossus, Limanda limanda, and H. stenolepis, were downloaded from NCBI and were then aligned to the marbled flounder T2T genome assembly to predict gene structure by miniprot(v0.11-r234)^44^ and Liftoff (v1.6.3)^45^ based on the homologous evidences. Moreover, RNA-seq data (accession number: SRR33264204) from ten mixed tissues were aligned to the starry flounder T2T genome using Hisat2 (v2.1.0)^46^ for transcriptome-based prediction. The mapped reads were assembled into transcripts using StringTie (v2.1.4)^47^. Subsequently, the merged transcripts were subjected to TransDecoder (v5.7.0, https://github.com/TransDecoder/TransDecoder) for protein-coding gene prediction. Finally, the above gene prediction results were integrated by MAKER2 (2.31.10)^48^, producing a non-redundant protein-coding gene set composed of 22,778 genes (Table 5). Comparisons of the statistical characteristics of the gene set elements between marbled flounder and other six related species exhibited a similar distribution patterns (Fig. 3).Table 5. Statistics of predicted protein-coding genes in marbled flounder.Gene setNumberAverage gene length (bp)Average CDS length (bp)Average exon per geneAverage exon length (bp)Average intron length (bp)De novoGenscan29,39613,8731,6169.09177.741,515AUGUSTUS30,2048,8731,3137.33179.171,195HomologHippoglossus hippoglossus50,94718,9472,23913.10170.921,381Platichthys flesus42,02214,2821,90811.00173.521,238Limanda limanda34,96113,2461,7849.91179.881,286Hippoglossus stenolepis45,45418,5302,25113.08172.121,348Pleuronectes platessa42,61013,9941,92710.59181.871,258LiftoffPlatichthys flesus21,81412,6501,81710.51172.891,139Pleuronectes platessa22,09812,2381,79710.38173.171,114Trans ORFRNAseq12,67115,2411,88312.09291.071,057MAKER22,77814,0421,70410.53267.221,178Fig. 3Comparisons of the statistical characteristics of the gene set elements between marbled flounder, P. flesus, P. platessa, H. hippoglossus, L. limanda, and H. stenolepis. (a) Gene length distributions. (b) CDS length distributions. (c) Exon length distributions. (d) Intron length distributions.
Functional annotation of protein-coding genes were performed by matching them with mutiple public protein databases, such as GO^49^, KEGG^50^, Pfam^51^, UniProt^52^, NR^53^, and InterPro^54^, using DIAMOND (v2.1.8)^55^ (E-value = 1e−5) or the corresponding inbuilt software. As a result, 22,386 genes, accounting for 98.28% of the predicted genes in marbled flounder, were annotated in at least one functional database (Table 6). In addition, the non-coding RNAs, including 2,703 tRNAs and 860 rRNAs, were annotated using tRNAscan-SE (v2.0.12)^56^ and BLASTN, respectively. 1,278 miRNAs and 1,318 snRNAs were annotated using Infernal (v1.1.4)^57^ based on Rfam database (Table 7).Table 6. Statistics of functional annotation of protein-coding genes in marbled flounder.NumberPercent (%)Total22,778Annotated22,38698.28NR22,99698.14SwissProt20,42687.16TrEMBL22,72396.97KOG18,82680.34TF5,58523.83InterPro21,92293.55GO16,75571.50KEGG_ALL22,63596.59KEGG_KO16,35469.79Pfam20,73288.47Unannotated3921.72Table 7Statistics of non-coding RNA in marbled flounder.TypeCopyAverage length(bp)Total length(bp)% of genomemiRNA1,27887110,8670.019026tRNA2,70375202,7330.034790rRNArRNA860121104,2670.01789318S21,1232,2460.00038528S00005.8S11541540.0000265S857119101,8670.017475snRNAsnRNA1,318147193,4040.033177CD-box18912623,9030.0041HACA-box7715311,7550.002016splicing1,042149155,7640.02672scaRNA101981,9820.00034
Gene family and phylogenetic analysis
Gene families of marbled flounder and the other seven species in the Pleuronectidae family, including H. hippoglossus, P. flesus, H. stenolepis, L. limanda, P. platessa, Scophthalmus maximus, and Paralichthys olivaceus, were identified and clustered by Orthofinder (v2.5.5)^58^. As a result, a total of 17,403 common families were obtained, which include 196 species-specific gene families (590 genes) identified in marbled flounder (Fig. 4B,C). Subsequently, 12,370 single copy genes were used for multiple sequence alignment using MAFFT (v7.525)^59^. Then, a maximum likelihood (ML) phylogenetic tree was constructed using RAxML (v8.2.12)^60^ and performed 1000 bootstrap replications. Additionally, MCMCtree program implemented in PAML (v4.9)^61^ was conducted to estimate divergence times among species. The result showed that marbled flounder began to diverge around 11.5 million years ago (Fig. 5).Fig. 4. Comparison of gene families between marbled flounder and H. hippoglossus, P. flesus, H. stenolepis, L. limanda, P. platessa, S. maximus, and P. olivaceus. (A) Statistics of homologous gene numbers across 8 species. (B) Gene family clustering across 8 species.Fig. 5. Phylogenetic relationships and divergence time estimation for marbled flounder and 7 other species in the Pleuronectidae family. S. maximus was chosen as an outgroup. All nodes were supported by 100 cycles of bootstrap resampling. Blue numbers represent the estimated divergence times with a 95% confidence interval. Divergences used for the recalibration of time estimation are marked by red dots.
Data Records
The WGS short-read sequencing data have been uploaded to NCBI Sequence Read Archive (SRA) database with the accession number SRR33323076^62^.The PacBio HiFi sequencing data have been deposited at NCBI SRA database with the accession number SRR33323077^63^. The Hi-C sequencing data have been submitted to NCBI SRA database with the accession number SRR33323078^64^. The ONT sequencing data have been deposited into NCBI SRA database with the accession number SRR33323079^65^. RNA-seq data can be obtained under the accession number SRR33264204^66^. The genome assembly has been submitted to the NCBI Genbank with accession number GCA_051314325.1^67^. The genome annotation file, gene CDS, and protein data have been deposited at Figshare^68^.
Technical Validation
Compared to the previously published marbled flounder genome, the contig N50 length of our T2T genome assembly increased from 2.00 kb to 26.29 Mb, exhibiting a substantial enhancement in contiguity of the genome assembly. The completeness of the marbled flounder genome assembly was assessed using BUSCO (v.5.1.2) with the actinopterygii_odb10 database containing 3,640 BUSCOs. The results showed that 3,614 (99.29%) complete BUSCOs were identified, which is higher than that of closely related species, such as Platichthys flesus (98.2%)^69^ and Pleuronectes platessa (97.9%)^70^, indicating a high level of integrity. Moreover, the base-level accuracy and the consensus quality value (QV) was evaluated using Merqury^71^, and the average per-base accuracy rates and QV were 0.99986 and 38.52, respectively, highlighting a high-quality assembly. Furthermore, WGS short reads, PacBio HiFi reads, and ONT ultra-long reads were aligned to the genome assembly using bwa (v0.7.17)^28^ and minimap2 (v2.17)^33^. The average mapping rates and coverage were more than 99.89% and 99.70%, respectively, highlighting a high sequences consistency.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
