The genome sequence of lesser trefoil or Irish shamrock, Trifolium dubium Sibth. (Fabaceae)
Markus Ruhsam, Peter M Hollingsworth, Ann M. Mc Cartney, Katie E. Herron, Graham M. Hughes, Maarten J. M. Christenhusz, Michael F. Fay, Ilia J. Leitch, Yoshinori Fukasawa, Rizky Dwi Satrio

TL;DR
This paper presents the genome sequence of Trifolium dubium, a plant in the legume family, as part of a larger genomic initiative.
Contribution
The study provides a high-quality genome assembly of Trifolium dubium, including chromosomal scaffolding and organellar genomes.
Findings
The genome assembly spans 679.1 megabases and is scaffolded into 15 chromosomal pseudomolecules.
The mitochondrial genomes are 133.86 kb and 182.32 kb in length, while the plastid genome is 126.22 kb.
Abstract
We present a genome assembly from an individual Trifolium dubium (lesser trefoil; Tracheophyta; Magnoliopsida; Fabales; Fabaceae) as part of a collaboration between the Darwin Tree of Life and the European Reference Genome Atlas. The genome sequence is 679.1 megabases in span. Most of the assembly is scaffolded into 15 chromosomal pseudomolecules. The two mitochondrial genomes have lengths of 133.86 kb and 182.32 kb, and the plastid genome assembly has a length of 126.22 kilobases.
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5| Project accession data | ||
|---|---|---|
| Assembly identifier | drTriDubi3.1 | |
| Species |
| |
| Specimen | drTriDubi3 | |
| NCBI taxonomy ID | 97021 | |
| BioProject | PRJEB59394 | |
| BioSample ID | SAMEA10983579 | |
| Isolate information | drTriDubi3: flower and leaf (DNA sequencing)
| |
| Assembly metrics
|
| |
| Consensus quality (QV) | 67.2 |
|
|
| 100.0% |
|
| BUSCO
| C:98.9%[S:5.4%, D:93.6%],
|
|
| Percentage of assembly
| 99.51% |
|
| Sex chromosomes | None |
|
| Organelles | Mitochondrial genomes:
|
|
| Raw data accessions | ||
| PacificBiosciences SEQUEL II | ERR10841331 | |
| Hi-C Illumina | ERR10851537 | |
| PolyA RNA-Seq Illumina | ERR10908617 | |
| Genome assembly | ||
| Assembly accession | GCA_951804385.1 | |
|
| GCA_951804395.1 | |
| Span (Mb) | 679.1 | |
| Number of contigs | 742 | |
| Contig N50 length (Mb) | 3.0 | |
| Number of scaffolds | 153 | |
| Scaffold N50 length (Mb) | 46.0 | |
| Longest scaffold (Mb) | 64.64 | |
| INSDC
| Chromosome | Length
| GC% |
|---|---|---|---|
| 1 | 64.64 | 35.0 | |
| 2 | 59.51 | 32.5 | |
| 3 | 59.37 | 32.5 | |
| 4 | 51.57 | 33.0 | |
| 5 | 50.66 | 33.0 | |
| 6 | 47.18 | 32.5 | |
| 7 | 46.01 | 32.5 | |
| 8 | 40.67 | 32.5 | |
| 9 | 40.21 | 32.0 | |
| 10 | 39.28 | 32.5 | |
| 11 | 38.74 | 32.0 | |
| 12 | 36.76 | 31.5 | |
| 13 | 35.95 | 32.0 | |
| 14 | 34.19 | 32.0 | |
| 15 | 31.43 | 31.5 | |
| Pltd | 0.13 | 35.0 | |
| MT1 | 0.13 | 44.5 | |
| MT2 | 0.18 | 45.5 |
| Software tool | Version | Source |
|---|---|---|
| BlobToolKit | 4.1.7 |
|
| BUSCO | 5.3.2 |
|
| HiCanu | 2.2 |
|
| HiGlass | 1.11.6 |
|
| Merqury | MerquryFK |
|
| MitoHiFi | 2 |
|
| OATK | 0.1 |
|
| PretextView | 0.2 |
|
| purge_dups | 1.2.3 |
|
| sanger-tol/genomenote | v1.0 |
|
| sanger-tol/readmapping | 1.1.0 |
|
| YaHS | 1.1a.2 |
|
- —Wellcome Trust
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPlant pathogens and resistance mechanisms · Plant Parasitism and Resistance · Botanical Research and Chemistry
Species taxonomy
Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliopsida; Mesangiospermae; eudicotyledons; Gunneridae; Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papilionoideae; 50 kb inversion clade; NPAAA clade; Hologalegina; IRL clade; Trifolieae; Trifolium; Trifolium dubium Sibth. (NCBI:txid97021).
Background
Lesser trefoil ( Trifolium dubium Sibth.), also known as lesser hop clover or suckling clover, is a common clover species that is considered by most to represent the traditional Irish shamrock. It is native and common across Europe, north to Scandinavia and south to Morocco and Turkey, but it is also found in many temperate regions of the world as an introduced species ( POWO, 2023).
Trifolium dubium is a mat-forming annual, which has up to 20 tiny yellow flowers packed in dense globular flower heads ( Figure 1). Most commonly, it occurs in unimproved grassland, but is also found in other habitats such as lawns, pastures, coastal meadows, roadsides, waste places and disturbed areas. Its adaptability to different environmental conditions has contributed to its prevalence in both natural and anthropogenic landscapes across its range.
Photographs of Trifolium dubium ( a and b are representative images for the species, but not the specimen pr population used for genome sequencing, c is a representative plant from the population that was used for genome sequencing). a) https://commons.wikimedia.org/wiki/User:Rasbak b) https://commons.wikimedia.org/wiki/User:Kenraiz c) Markus Ruhsam.
There has been much discussion on the identity of the “true” shamrock, but for over a century the majority of people surveyed consider T. dubium to be the real one ( Colgan, 1892; Colgan, 1893; Nelson, 1991). Shamrock flowers from May to October in Ireland, so it is not generally in flower on St Patrick’s Day (17 March); however, leaves of T. dubium are worn on St. Patrick’s Day, and have since become a floral symbol of Ireland. Trifolim dubium appears in numerous emblems of state and non-state organisations and companies across the Republic of Ireland, Northern Ireland, and beyond. Together with the harp, the shamrock is registered as an international trademark by the Government of Ireland.
The legend of the shamrock holds that St Patrick used its three-parted clover leaflets to explain to the Irish people the Christian concept of the Holy Trinity ( Van Treeck & Croft, 1936), although the word “shamrock” derives from the Irish words seamair (clover) and óg (young) ( Nelson, 1991).
While T. dubium is not typically cultivated as a primary crop, like most legumes it is capable of fixing atmospheric nitrogen through its symbiotic relationship with nitrogen-fixing bacteria in root nodules ( Brock, 1973). This enriches the soil as well as the plants themselves, which therefore provide a good source of macro- and micronutrients and protein for livestock ( Brock, 1973; Gounden et al., 2018). This species and several related species of Trifolium also produce condensed tannins (unlike the major crop clover species T. repens L. and T. pratense L.), making them of interest to breeding programmes of forage legumes, because they are less likely to cause legume bloat in ruminants ( Fay & Dale, 1993).
While many cytological studies of Trifolium species have indicated that most (about 80%) are diploid based on x = 8 (with descending dysploidy giving rise to x = 7, 6 or 5 in some species; Ellison et al., 2006), counts of T. dubium have suggested it is a tetraploid, although there has been some discrepancy as to whether it is 2 n = 28 or 30 ( Ansari et al., 2008; Taylor et al., 1983; Vižintin et al., 2006; Zohary & Heller, 1984), or 2 n = 32 (based on a chromosome count of a plant from Kent, England; Gornall and Bailey, 1993). Recent molecular cytogenetic studies of T. dubium with 2 n = 30, are in agreement with the genome assembly reported here, and have provided important insights into its genetic composition and evolution (e.g. Ansari et al., 2008; Vozárová et al., 2021). Such studies have proposed that the species is an allotetraploid that likely arose from natural hybridisation between T. campestre Schreb. (2 n = 14) and T. micranthum Viv. (2 n = 16) ( Ansari et al., 2008).
Whole genome sequence data are now available for at least six Trifolium species (e.g. Bickhart et al., 2022; Garg et al., 2022; Griffiths et al., 2019; Santangelo et al., 2023), and here we present the first high-quality genome for T. dubium, stemming from a collaboration involving the Darwin Tree of Life Project and the European Reference Genome Atlas pilot project. We anticipate this genome will be a valuable genomic resource for a range of future studies. These include comparative analyses focused on the evolution of allopolyploid genomes, as well as studies exploring its potential as an additional nutritional source for livestock, especially given its high condensed tannin content.
Genome sequence report
The genome was sequenced from a specimen of Trifolium dubium collected from Gorebridge, UK (55.84, –3.04). Using flow cytometry, the genome size (1C-value) was estimated to be 0.84 pg, equivalent to 820 Mb. A total of 72-fold coverage in Pacific Biosciences single-molecule HiFi long reads was generated. Primary assembly contigs were scaffolded with chromosome conformation Hi-C data. Manual assembly curation corrected 283 missing joins or mis-joins, reducing the scaffold number by 61.95%, and increasing the scaffold N50 by 14.41%.
The final assembly has a total length of 679.1 Mb in 153 sequence scaffolds with a scaffold N50 of 46.0 Mb ( Table 1). The snail plot in Figure 2 provides a summary of the assembly statistics, while the distribution of assembly scaffolds on GC proportion and coverage is shown in Figure 3. The cumulative assembly plot in Figure 4 shows curves for subsets of scaffolds assigned to different phyla. Most (99.51%) of the assembly sequence was assigned to 15 chromosomal-level scaffolds. Chromosome-scale scaffolds confirmed by the Hi-C data are named in order of size ( Figure 5; Table 2). The order and orientation of contigs on chromosome 1 between 37.5 Mb and 42.4 Mb is uncertain. While not fully phased, the assembly deposited is of one haplotype. Contigs corresponding to the second haplotype have also been deposited. The mitochondrial and plastid genomes were also assembled and can be found as contigs within the multifasta file of the genome submission.
Table 1.: Genome data for Trifolium dubium, drTriDubi3.1.
Genome assembly of Trifolium dubium, drTriDubi3.1: metrics.The BlobToolKit Snailplot shows N50 metrics and BUSCO gene completeness. The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 679,499,717 bp assembly. The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold present in the assembly (64,644,275 bp, shown in red). Orange and pale-orange arcs show the N50 and N90 scaffold lengths (46,006,535 and 34,190,264 bp), respectively. The pale grey spiral shows the cumulative scaffold count on a log scale with white scale lines showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot. A summary of complete, fragmented, duplicated and missing BUSCO genes in the fabales_odb10 set is shown in the top right. An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/drTriDubi3_1/dataset/drTriDubi3_1/snail.
Genome assembly of Trifolium dubium, drTriDubi3.1: BlobToolKit GC-coverage plot.Scaffolds are coloured by phylum. Circles are sized in proportion to scaffold length. Histograms show the distribution of scaffold length sum along each axis. An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/drTriDubi3_1/dataset/drTriDubi3_1/blob.
Genome assembly of Trifolium dubium, drTriDubi3.1: BlobToolKit cumulative sequence plot.The grey line shows cumulative length for all scaffolds. Coloured lines show cumulative lengths of scaffolds assigned to each phylum using the buscogenes taxrule. An interactive version of this figure is available at https://blobtoolkit.genomehubs.org/view/drTriDubi3_1/dataset/drTriDubi3_1/cumulative.
Genome assembly of Trifolium dubium, drTriDubi3.1: Hi-C contact map of the drTriDubi3.1 assembly, visualised using HiGlass. Chromosomes are shown in order of size from left to right and top to bottom. An interactive version of this figure may be viewed at https://genome-note-higlass.tol.sanger.ac.uk/l/?d=F0MkUMXRRMqAc9HJXoQ8LA.
Table 2.: Chromosomal pseudomolecules in the genome assembly of Trifolium dubium, drTriDubi3.
The estimated Quality Value (QV) of the final assembly is 67.2 with k-mer completeness of 100.0%, and the assembly has a BUSCO v5.3.2 completeness of 98.9% (single = 5.4%, duplicated = 93.6%), using the fabales_odb10 reference set ( n = 5,366).
Metadata for specimens, barcode results, spectra estimates, sequencing runs, contaminants and pre-curation assembly statistics are given at https://tolqc.cog.sanger.ac.uk/darwin/dicots/Trifolium_dubium/.
Methods
Sample acquisition, genome size estimation and nucleic acid extraction
Leaf and flower samples of Trifolium dubium were collected from Gorebridge, Scotland, UK (latitude 55.84, longitude –3.04) on 2021-08-11. One specimen was used for DNA sequencing (specimen ID EDTOL02342, ToLID drTriDubi3); another was used for Hi-C sequencing (specimen ID EDTOL02341, ToLID drTriDubi2); and a third specimen was used for RNA sequencing (specimen ID EDTOL02343, ToLID drTriDubi4). The specimens were collected and identified by Markus Ruhsam (Royal Botanic Garden Edinburgh) and preserved in liquid nitrogen. A voucher specimen from the same population of the sequenced plant is housed in the herbarium of the Royal Botanic Garden Edinburgh (E), available at https://data.rbge.org.uk/herb/E01152325.
The genome size was estimated by flow cytometry using the fluorochrome propidium iodide and following the ‘one-step’ method as outlined in Pellicer et al. (2021). The General Purpose Buffer (GPB) supplemented with 3% PVP and 0.08% (v/v) beta-mercaptoethanol was used for isolation of nuclei ( Loureiro et al., 2007), and the internal calibration standard was Petroselinum crispum ‘Champion Moss Curled’ with an assumed 1C-value of 2,200 Mb ( Obermayer et al., 2002).
The workflow for high molecular weight (HMW) DNA extraction at the Wellcome Sanger Institute (WSI) includes a sequence of core procedures: sample preparation; sample homogenisation, DNA extraction, fragmentation, and clean-up. In sample preparation, the drTriDubi3 sample was weighed and dissected on dry ice ( Jay et al., 2023).
For sample homogenisation, flower and leaf tissue was cryogenically disrupted using the Covaris cryoPREP ^®^ Automated Dry Pulverizer ( Narváez-Gómez et al., 2023). HMW DNA was extracted using the Automated Plant MagAttract v2 protocol ( Todorovic et al., 2023a). HMW DNA was sheared into an average fragment size of 12–20 kb in a Megaruptor 3 system with speed setting 30 ( Todorovic et al., 2023b). Sheared DNA was purified by solid-phase reversible immobilisation ( Strickland et al., 2023): in brief, the method employs a 1.8X ratio of AMPure PB beads to sample to eliminate shorter fragments and concentrate the DNA. The concentration of the sheared and purified DNA was assessed using a Nanodrop spectrophotometer and Qubit Fluorometer and Qubit dsDNA High Sensitivity Assay kit. Fragment size distribution was evaluated by running the sample on the FemtoPulse system.
RNA was extracted from flower tissue of drTriDubi4 in the Tree of Life Laboratory at the WSI using the RNA Extraction: Automated MagMax™ mirVana protocol ( do Amaral et al., 2023). The RNA concentration was assessed using a Nanodrop spectrophotometer and a Qubit Fluorometer using the Qubit RNA Broad-Range Assay kit. Analysis of the integrity of the RNA was done using the Agilent RNA 6000 Pico Kit and Eukaryotic Total RNA assay.
Protocols developed by the WSI Tree of Life core laboratory are publicly available on protocols.io ( Denton et al., 2023).
Sequencing
Pacific Biosciences HiFi circular consensus DNA sequencing libraries were constructed according to the manufacturers’ instructions. Poly(A) RNA-Seq libraries were constructed using the NEB Ultra II RNA Library Prep kit. DNA and RNA sequencing was performed by the Scientific Operations core at the WSI on Pacific Biosciences SEQUEL II (HiFi) and Illumina NovaSeq 6000 (RNA-Seq) instruments. Hi-C data were also generated from flower and leaf tissue of drTriDubi2 using the Arima2 kit and sequenced on the Illumina NovaSeq 6000 instrument.
Genome assembly, curation and evaluation
Assembly was carried out with HiCanu ( Nurk et al., 2020) and haplotypic duplication was identified and removed with purge_dups ( Guan et al., 2020). The assembly was then scaffolded with Hi-C data ( Rao et al., 2014) using YaHS ( Zhou et al., 2023). The assembly was checked for contamination and corrected as described previously ( Howe et al., 2021). Manual curation was performed using HiGlass ( Kerpedjiev et al., 2018) and PretextView ( Harry, 2022). The organelle genomes were assembled using MitoHiFi ( Uliano-Silva et al., 2023) and OATK ( Zhou, 2023).
A Hi-C map for the final assembly was produced using bwa-mem2 ( Vasimuddin et al., 2019) in the Cooler file format ( Abdennur & Mirny, 2020). To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated in Merqury ( Rhie et al., 2020). This work was done using Nextflow ( Di Tommaso et al., 2017) DSL2 pipelines “sanger-tol/readmapping” ( Surana et al., 2023a) and “sanger-tol/genomenote” ( Surana et al., 2023b). The genome was analysed within the BlobToolKit environment ( Challis et al., 2020) and BUSCO scores ( Manni et al., 2021; Simão et al., 2015) were calculated.
Table 3 contains a list of relevant software tool versions and sources.
Wellcome Sanger Institute – Legal and Governance
The materials that have contributed to this genome note have been supplied by a Darwin Tree of Life Partner. The submission of materials by a Darwin Tree of Life Partner is subject to the ‘Darwin Tree of Life Project Sampling Code of Practice’, which can be found in full on the Darwin Tree of Life website here. By agreeing with and signing up to the Sampling Code of Practice, the Darwin Tree of Life Partner agrees they will meet the legal and ethical requirements and standards set out within this document in respect of all samples acquired for, and supplied to, the Darwin Tree of Life Project.
Further, the Wellcome Sanger Institute employs a process whereby due diligence is carried out proportionate to the nature of the materials themselves, and the circumstances under which they have been/are to be collected and provided for use. The purpose of this is to address and mitigate any potential legal and/or ethical implications of receipt and use of the materials as part of the research project, and to ensure that in doing so we align with best practice wherever possible. The overarching areas of consideration are:
• Ethical review of provenance and sourcing of the material
• Legality of collection, transfer and use (national and international)
Each transfer of samples is further undertaken according to a Research Collaboration Agreement or Material Transfer Agreement entered into by the Darwin Tree of Life Partner, Genome Research Limited (operating as the Wellcome Sanger Institute), and in some circumstances other Darwin Tree of Life collaborators.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abdennur N Mirny LA : Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. 2020;36(1):311–316. 10.1093/bioinformatics/btz 540 31290943 PMC 8205516 · doi ↗ · pubmed ↗
- 2Ansari HA Ellison NW Williams WM : Molecular and cytogenetic evidence for an allotetraploid origin of Trifolium dubium (Leguminosae). Chromosoma. 2008;117(2):159–167. 10.1007/s 00412-007-0134-4 18058119 · doi ↗ · pubmed ↗
- 3Bickhart DM Koch LM Smith TPL : Chromosome-scale assembly of the highly heterozygous genome of red clover ( Trifolium pratense L.), an allogamous forage crop species. Giga Byte. 2022;2022: gigabyte 42. 10.46471/gigabyte.42 36824517 PMC 9650271 · doi ↗ · pubmed ↗
- 4Brock JL : Growth and nitrogen fixation of pure stands of three pasture legumes with high/low phosphate. New Zealand Journal of Agricultural Research. 1973;16(4):483–491. 10.1080/00288233.1973.10421093 · doi ↗
- 5Challis R Richards E Rajan J : Blob Tool Kit - interactive quality assessment of genome assemblies. G 3 (Bethesda). 2020;10(4):1361–1374. 10.1534/g 3.119.400908 32071071 PMC 7144090 · doi ↗ · pubmed ↗
- 6Colgan N : The shamrock: an attempt to fix its species. The Irish Naturalist. 1892;1(5):95–97. Reference Source
- 7Colgan N : The shamrock: a further attempt to fix its species. The Irish Naturalist. 1893;2(8):207–211. Reference Source
- 8Denton A Yatsenko H Jay J : Sanger Tree of Life wet laboratory protocol collection V.1. protocols.io. 2023. 10.17504/protocols.io.8epv 5xxy 6g 1b/v 1 · doi ↗
