Metagenome-based vertical profiling of the Gulf of Mexico highlights its uniqueness and far-reaching effects of freshwater input
Roth E. Conrad, Despina Tsementzi, Alexandra Meziti, Janet K. Hatt, Joseph Montoya, Konstantinos T. Konstantinidis

TL;DR
This study explores microbial diversity in the Gulf of Mexico using metagenomics, revealing unique genetic adaptations and the long-reaching effects of freshwater input.
Contribution
The study provides 154 novel metagenome-assembled genomes from the Gulf of Mexico, highlighting its distinct microbial diversity and freshwater influence.
Findings
Microbial communities in the Gulf of Mexico cluster strongly by depth rather than location.
Over half of the predicted gene sequences in the Gulf of Mexico are unique to specific sampling stations.
Freshwater-derived microbes were detected at deep, offshore locations, indicating long-lasting effects of freshwater input.
Abstract
Genomic and metagenomic explorations of the oceans have identified well-structured microbial assemblages showing endemic genomic adaptations with increasing depth. However, deep water column surveys have been limited, especially of the Gulf of Mexico (GoM) basin, despite its importance for human activities. To fill this gap, we report on 19 deeply sequenced (~5 Gbp/sample) shotgun metagenomes collected along a vertical gradient, from the surface to about 2,000 m deep, at three GoM stations. Beta diversity analysis revealed strong clustering by depth, and not by station. However, a community-level pangenome style gene content analysis revealed ~54% of predicted gene sequences to be station-specific within our GoM samples. Of the 154 medium-to-high-quality MAGs recovered, 145 represent novel species compared with the NCBI genomes and Tara Oceans MAGs databases. Two of these MAGs were…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5- —Gulf of Mexico Research Initiativehttp://dx.doi.org/10.13039/100007240
- —National Science Foundationhttp://dx.doi.org/10.13039/501100008982
- —National Science Foundationhttp://dx.doi.org/10.13039/501100008982
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMicrobial Community Ecology and Physiology · Genomics and Phylogenetic Studies · Methane Hydrates and Related Phenomena
INTRODUCTION
While an increasing part of extant microbial diversity is being discovered via (mostly) culture-independent metagenomics approaches, the majority of the diversity has not been recovered yet, especially for enormous environments such as the oceans (1–5). Cataloging this diversity is important for several reasons, including to better understand how this powerful biogeochemical force may be affected in a dynamic and changing world (6–11). To this end, efforts have been undertaken in the past two decades to collect metagenomic samples that cover representative portions of the global ocean (12–19). However, since additional insights are still being gained from every new sample, more sampling is needed across space and time, particularly in regions such as the Gulf of Mexico (GoM) that are important to and heavily influenced by human activities, yet remain undersampled.
Since the first microbial sequences from the surface and deep ocean were compared, genetic differences have been reported between the two habitats (20, 21). For instance, several studies have revealed differences in genome size and amino acid composition between the photic zone and the abyss (22). With increased genome size with increasing depth, the abundance of certain gene functions like transposases and integrases has also been noted to increase with depth (23–25). The physicochemical differences between the surface and deep ocean, including light intensity, salinity, temperature, pressure, and nutrient availability, presumably drive some of these functional gene adaptations. Furthermore, relaxed purified (negative) selection caused by smaller population sizes in the deep vs. the surface ocean has been hypothesized to promote—at least in part—the larger genome size and increased mobile gene content in the deep-sea genomes (21). However, to better quantify and understand the effects of these physicochemical parameters and their distribution throughout different ocean basins, more vertical water column profiles are clearly needed, especially below the photic zone.
Specifically, few metagenomes have been sequenced from the GoM, and these were typically associated with oil spills, hydrocarbon seeps, and hypoxic zones (26–33). In this study, samples were collected during a research voyage targeting surface Trichodesmium blooms, which were identified and tracked via satellite imaging, during a period without major oil inputs. Accordingly, our study enabled the sampling of natural bloom events across spatial gradients. Additionally, we aim to assess how the unique characteristics and geography of the GoM—including substantial riverine inputs from the Mississippi River and others—influence the vertical stratification of microbial assemblages.
To provide an additional metagenomic perspective on the ocean water column, we sequenced 15 shotgun metagenomes in the northern GoM at five distinct depths ranging from the surface down to the oxygen minimum (OM) depth between 300 and 500 m (note that these oxygen minimum depths do not directly correspond to the Oxygen Minimum Zones [OMZs] identified in other ocean basins, which are characterized by a different process of formation and significantly lower oxygen concentrations). We sequenced four additional shotgun metagenomes at one station reaching below the OM depth, down to 2,100 m. We analyzed microbial community diversity and taxonomic composition as well as genetic similarity and functional gene abundance in each sample. We report the fraction of common gene sequences shared between samples compared at 40% and 70% amino acid sequence identity (roughly corresponding to family and genus levels, respectively), as well as 90% and 95% nucleotide sequence identity (species level). We used two thresholds for the species level in order to account for the most frequently observed area of discontinuity among species at 95% genome-aggregate average nucleotide identity (ANI) (34) (95% threshold used); in addition, the facts that several abundant marine taxa like SAR11 show high intraspecies diversity (area of discontinuity is at 90%–91% ANI [35]) and fast-evolving genes in the genome may sometimes show lower identity values than 95% (90% threshold used). We also report gene copy counts and relative abundance of predicted functions in each sample and describe the functional gene content that correlates well with ocean density (which correlates with temperature and depth). Finally, we present novel GoM-associated metagenome-assembled genomes (MAGs) analyzed with the Microbial Genome Atlas (MiGA) alongside their relative abundance distribution across our sample set.
RESULTS
Study stations
Water column samples were collected during conductivity, temperature, and depth rosette (CTD) casts from the May 2012 R/V Endeavor cruise EN509, and shotgun metagenomes were sequenced from the surface (~3 m), the mixed layer (ML; 15–25 m), the deep chlorophyll maximum (DCM; 70–90 m), below the DCM but above the oxygen minimum (OM) depth (aOM; 100–150 m), and the OM (200–400 m) from three stations (2, 5, and 8) in the northwest GoM (Fig. 1A). Four additional metagenomes were sequenced from below the OM depth (Deep; 600, 1,000, 1,470, and 2,107 m) from station 5. Station 2 was near the edge of the Texas-Louisiana Shelf about 50 miles southeast of the mouth of the Mississippi River. Stations 8 and 5 were over the TX-LA Slope about 190 and 270 mi. southeast of Galveston, TX, with station 8 nearer the TX-LA Shelf and station 5 nearer the edge of the TX-LA slope. Fluorescence, oxygen, salinity, and temperature measurements were similar at all stations, except salinity was lower (29.6 vs. 36 practical salinity unit [PSU]), and fluorescence was increased (0.9 vs. 0.07 mg/m^3^) at the surface of station 2, whereas the fluorescence peak of the DCM layer was reduced (0.3 vs. 0.8 mg/m^3^) compared to stations 5 and 8 (Fig. 1B through E). Temperature and salinity were most similar for all samples from the ML and DCM, whereas salinity was variable across surface samples, and temperature and salinity were variable across aOM and OM depth samples (Fig. 1E).
Overview of samples sequenced for this study. CTD casts were made at three stations in the GoM (A), and temperature, salinity, oxygen, and fluorescence measurements were recorded along the depth gradient of the water column for Stations 2 (B), 5 (C), and 8 (D). Temperature-salinity diagram of metagenome sample collection points with isopycnic lines expressing density as σt = (ρ−1,000)[kg/m3] (E). Stations are marked with diamonds (Station 2), circles (Station 5), or squares (Station 8). Colors denote the depth layer (1–9) of the metagenome sample collection. Sequencing effort (F), the number of contigs assembled (G), total assembly length in megabase pairs (H), assembly N50 (I), predicted CDS (J), unassembled reads mapping back to the assembled contigs (K), and unassembled reads mapping to rMAGs (L) are reported for each metagenome sample analyzed.
Metagenome statistics
Sequencing effort, except for sample EN56, which was unintentionally (variation during library loading) sequenced about 3.4 times more deeply, ranged from 3 to 5 Gbp (giga base pairs) for all samples (Fig. 1F) and covered an average of 58% of the estimated sequence diversity of the microbial communities sampled (Fig. 2C; File S1). The Nonpareil diversity (Npc) covered by sequencing effort decreased with depth (Pearson’s r = −0.32; Fig. 2C; File S1, GoM DivCov tab), due presumably to increased genome sizes and greater microbial diversity observed in deeper samples (see also below). The number of contigs assembled, the overall assembly length in Mbp (mega base pairs), and the predicted coding sequences (CDS) were moderately correlated with sequencing effort (Pearson’s r ~ 0.6), whereas assembly N50 was only weakly correlated (Pearson’s r = 0.31; Fig. 1F through J; File S1, GoM Effort tab). Excluding sample EN56, an average of approximately 64,500 contigs and 195,000 CDS were recovered per sample. The 340% increase in sequencing effort for sample EN56 yielded over 144,000 more contigs and 460,000 more CDS on average compared to the other samples (224%–237% increase; Fig. 1F through J). Nonetheless, reads used in the assembly and the diversity covered by sequencing effort remained within the 22%–63% (Fig. 1K) and 47%–77% (Fig. 2C) range of the other samples, respectively, which showed that although greater sequencing effort increased DNA and gene recovery, the amount of recovered but unassembled DNA sequence and rare biosphere sequences also increased (File S1). Only 22%–50% of the sequenced reads were used by the assembled contigs across samples, and only 9%–30% were mapped back to the representative metagenome assembled genomes (rMAGs) binned from those contigs (Fig. 1K and L; File S1), revealing that an average of 52% of the total diversity captured by sequencing was represented by the assembled sequences and predicted CDS (rMAGs are representatives of the 95% ANI-based clusters created by redundant MAGs during dereplication; see Materials and Methods for further details).
Community-level diversity metrics. Average genome size estimates from MicrobeCensus (A), alpha diversity estimates from Nonpareil (B), diversity covered with sequencing effort estimates by Nonpareil (C), and GC content of unassembled (D) vs assembled (E) reads are reported for each sample. The taxonomic distribution of the top 20 most abundant genera in each sample as reported by Kraken with Bracken (F). Note the overwhelming fraction of unclassified reads (reads with no match to the current databases). Beta diversity, as assessed by the Simka Jaccard distance, is shown as a PCoA (G) or hierarchical clustering (H). Note strong sample grouping by depth layer.
While the reads represented by assembled contigs from the surface and ML samples were in the upper range of all samples (56%–63%), the reads represented by rMAGs from samples below the DCM were significantly greater (P < 0.05) than in the DCM and above samples (Fig. 1K and L; File S1, GoM Effort tab includes details of the statistical comparisons performed), which suggested that the deeper water communities are better covered by rMAGs in our analysis. Correspondingly, there were 73 more rMAGs recovered from the aOM and below (68 rMAGs surface – DCM vs. 141 rMAGs aOM – Deep; File S4, MAG-Sample tab), which seemed to explain this discrepancy in rMAG read representation above and below the DCM and indicated better MAG recovery from samples below the DCM. These results are presumably attributable to the deep communities having a similar number of abundant members to the surface communities, which are thus recoverable by genome binning, while also having more rare members, and thus higher overall diversity. Furthermore, 65 of the 141 rMAGs recovered from below the DCM originated from sample EN56, and thus, a better MAG yield was also associated with the increased sequencing depth from this sample. Even so, after excluding sample EN56, there is still an 8 MAG difference below the DCM, and the mean difference of rMAG recovery per sample is still significant (76 rMAGs below DCM – 68 rMAGs above; P < 0.05; File S4, MAG-Sample tab), which yielded a 13.4% increase in community representation by rMAGs based on read mapping results below the DCM (File S1, GOM Effort tab). Regardless, the G+C% content of the assembled contigs remained similar (mean difference < 1%) to that of the unassembled reads, suggesting that the assembly is likely not biased toward any specific proportion of the community and is likely representative, at least in terms of G+C% (Fig. 2D and E; File S1).
Microbial community metrics
Apart from sample EN21, which was an outlier, average genome size had the strongest correlation with density (Pearson’s r = 0.86), followed by temperature (Pearson’s r = −0.80) and depth (Pearson’s r = 0.67), which indicated that microbial genome size increased in denser, colder, and deeper water masses (Fig. 2A; File S1, GoM GenomeSize tab). Likewise, G+C% content also increased (excluding EN21) most strongly with density (Pearson’s r = 0.88), followed by temperature (Pearson’s r = −0.87) and depth (Pearson’s r = 0.58; File S1, GoM %GC tab). While alpha diversity was only weakly correlated with depth across all samples (Pearson’s r = 0.37), alpha diversity correlation with depth was stronger when each sampling station was considered separately (Pearson’s r = 0.60, 0.68, and 0.77 for Stations 2, 8, and 5), which also indicated increased alpha diversity in denser, colder, and deeper water masses (Fig. 2B; Fig. S1; File S1, GoM AlphaDepth tab), and implied increased genome size is associated with increased G+C% content and alpha diversity, which corroborated previous studies (21).
Based on taxonomic annotations of the metagenomic reads against publicly available genomes (Kraken2 and Bracken results), the largest proportion of the microbial community was taxonomically classified from the surface and ML samples taken from Stations 5 and 8 (~20% vs <10%), and no more than 25% of the microbial community was classified in any sample (Fig. 2F), revealing that the GoM water column communities are underrepresented in available genome databases. While samples EN21, EN84, and EN56 appeared the least similar to the other samples in their depth layers in terms of taxonomic composition (Fig. 2F through H), short-read, kmer-based beta-diversity clustering showed more similarity by depth layer than by sampling station (Fig. 2G and H). This depth-layer similarity held true even when GoM samples were compared with publicly available metagenome samples from other ocean basins (Fig. S2; File S1, Ocean Overview tab). Interestingly, alpha diversity was significantly lower in the GoM compared to the North and South Atlantic and Pacific samples (P < 0.05; File S1, Ocean Basin tab includes details of the statistical comparisons performed).
Members of the Prochlorococcus and Pelagibacter genera had the greatest abundances across all samples (12.5% and 5.3% max, cumulative relative abundance, respectively), with the greatest abundance in samples from the DCM to the surface (> 1% relative abundance DCM and above vs. < 1% below; Fig. 2F; File S2, Genus tab). Prochlorococcus was mostly absent below the DCM (< 0.1%), whereas some Pelagibacter species remained prominent members of deeper water communities, although their relative abundance was much lower below the DCM (0.%–0.7%; Fig. 2F; File S2). The relative abundance of Prochlorococcus and Pelagibacter species was greatly reduced in sample EN21 compared to other samples above the DCM (<0.2%). In fact, the taxonomic abundance profiles for the surface and ML samples at Station 2 looked quite different compared to Stations 5 and 8, but the DCM, aOM, and OM profiles looked highly similar across all stations (Fig. 2F; File S2), indicating a signal from coastal proximity and/or that the freshwater riverine output from the Mississippi has a large influence on microbial assemblages of the ML and surface at this station.
Members of the Pseudoalteromonas and Alteromonas genera were the next most abundant across all samples, with samples EN51 and EN84 showing the largest proportions of Pseudoalteromonas members (3.9% and 2.2%), and samples EN59, EN57, and EN52 showing the greatest proportions of Alteromonas (1.5%, 0.8%, and 1.1%; Fig. 2F; File S2). Apart from sample EN84, which showed a high relative abundance of Pseudoalteromonas (2.2%), Pseudoalteromonas and Alteromonas species were greatly reduced in samples from Stations 2 and 8 (< 0.1%). They were also reduced in the DCM, aOM, and OM communities. Pseudoalteromonas abundance was greatest at the surface of Station 5.
The taxonomic profile of sample EN56, which was sequenced 3.4 times higher than other samples, was distinct from the other deep-water samples and all samples in general. Sample EN56 harbored a substantial proportion of members from the Fontibacterium (formerly Candidatus Fonsibacter; 4.1%), Nanopelagicus (1.1%), and Polynucleobacter (0.8%) genera whose relative abundance was much lower in other samples (<0.01%) (Fig. S3 and S5). Consistent with these Kraken-based results, we recovered several rMAGs from sample EN56 matching these species, such as Fontibacterium commune and Candidatus Nanopelagicus limnes (File S4). Sample EN56 was also the only sample below the ML with a high relative abundance of Synechococcus (0.3%), which was predominant in all surface and ML samples (0.3%–0.9%) but greatly reduced in DCM, aOM, OM, and deep samples (<0.05%). Sample EN56 also had the largest fraction of classified reads of any sample below the ML (13.7% classified vs 7.8%–10.3% in the DCM, 4.6%-7.6% aOM, ~5% OM and deep; Fig. 2F; File S2).
Gene sequence similarity by depth and by sample. Gene clustering results are partitioned into (A and D) genes shared by all samples in the category (shared genes), (B and E) genes shared by two or more samples but not all (flexible genes), and (C and F) genes found in only a single sample (station-specific genes) in that category. Panels A–C display gene clustering results at 95% nucleotide sequence identity (blue bars), 90% nucleotide sequence identity (gold bars), 70% amino acid sequence identity (green bars), and 40% amino acid sequence identity (red bars). Panels D–F display gene clustering results from 40% amino acid sequence identity, but for genes predicted from assemblies of down-sampled metagenomes at 25% subsample (blue bars), 50% subsample (gold bars), 75% subsample (green bars), and the full sample (red bars – same as A–C red bars). The various sample groupings are listed at the bottom along the x-axis with horizontal groupings by depth in the first six positions, followed by vertical groupings by sampling station and all stations. Note the larger y-axis values in C and F, indicating that the majority of assembled genes were station-specific. CDS were predicted with Prodigal from all assembled contigs for each sample and clustered with MMSeqs2.
Vibrio abundance was greatest at the surface of Station 5 (2.0%) and also the ML, aOM, and deep samples EN52, 54, 58, and 59 (0.1%–0.3%), and members of this genus were present but reduced in the remaining samples (0.03%–0.07%; Fig. 2F; File S2). Members of the Paenibacillus genus were consistently abundant community members across all samples (0.13%–0.25%). Similarly, members of the Nitrosopelagicus (Nitrososphaerota, formerly Thaumarchaeota) genus were also consistently present at low abundance across all samples (0.001%–0.003%) except in sample EN56. Members of the Nitrosopumulus (Nitrososphaerota) genus became abundant at and below the DCM (0.001%–0.005% at the surface to 0.12%–0.43% DCM and below). Relative abundance of Acinetobacter was highest at Station 8 surface to aOM and Station 2 ML (0.09%–0.28%) but maintained a presence in all samples (0.04%–0.08%) with lowest abundance in sample EN56. Thioglobus species were also consistently present in microbial communities across all samples (0.01%–0.26%) with greatest abundance in the aOM and OM of all three stations, lower abundance in the surface and ML, and the lowest abundance at the surface of Station 2 (sample EN21) near the coast and the Mississippi River outlet. Interestingly, the relative abundance of the bacteriophage Eurybiavirus was significantly higher in the surface to DCM samples (0.02%–0.50% vs. < 0.01% below the DCM, P < 0.05; File S1, Eurybiavirus tab includes details of the statistical comparisons performed) with the greatest relative abundance in the ML at Station 5. See Fig. 2 and File S2 for more extensive taxonomic details.
Functional gene content analysis
To assess sequence conservation and diversity at each station (vertically) or within each depth layer (horizontally) captured in our GoM metagenome samples, we clustered CDS predicted from assembled contigs at 90% and 95% nucleotide sequence identity and 40% and 70% amino acid sequence identity (Fig. 3A through C) for various sample groupings at each depth layer (surface, ML, DCM, aOM, OM, and deep), each station (2, 5, 8), and all stations (surface OM and all depths). We chose 40%, 70%, 90%, and 95% cutoffs because they are commonly used thresholds that roughly correspond to distinct taxonomic ranks (above genus, genus, and species) and provide a more and less conservative perspective of the data. If a gene cluster contained a CDS sequence from all samples in a grouping, it was counted as a shared gene; if it contained CDS sequences from two or more (but not all) samples in a grouping, it was counted as a flexible gene; and if it was a singleton (cluster of 1 CDS sequence only) or contained CDS sequences from only a single sample, it was counted as a station-specific gene. This analysis showed that up to 35% of genes (OM layer 90% nuc and 40% and 70% aa clusters) were shared horizontally across all three samples taken from the same depth layer across the different stations (mean ~24%) but only up to 4% of gene sequences were shared vertically across the top five depth layers taken from the same station (mean ~3%; Fig. 3A; File S3). The surface and deep layers showed the fewest horizontally shared gene sequences, likely due to samples EN21 and EN56 appearing as outliers in other analyses. The aOM layer, which contained sample EN84, was also an outlier and showed slightly fewer shared genes within this depth layer than the ML, DCM, or OM layers (Fig. 3A; File S3). Interestingly, this analysis highlighted that 67% of genes across all samples contained nucleotide CDS sequence that are station-specific at the 95% sequence identity level (Fig. 3C; File S3), revealing the vast sequence diversity the oceans harbor and that our sequencing efforts are far from discovering all of it. Indeed, even clustering CDS sequences at 40% amino acid sequence identity, 60% of genes were found to be unique to a specific sample (i.e., station-specific), and the greatest percentage of shared genes between any two samples was 56% (shared + flexible genes) from the 90% nucleotide sequence identity gene clusters (Fig. 3A and B; File S3).
To assess how much the sequencing effort or lack of complete coverage of sequencing diversity influenced the shared proportion of genes between samples, we subsampled each metagenome by randomly selecting without replacement 25%, 50%, and 75% of the total reads. We then assembled the subsampled metagenomes, predicted CDS, and clustered the genes at 40% amino acid identity within each subsampled set separately (Fig. 3D and F; File S3, Subsampling tab). This analysis showed that the greatest difference, which was found between the 25% subsampled and full set (100% sample), was only an 8.71% reduction in shared genes. This result suggested that the ratio of shared to specific genes scales with sequencing effort and that we should expect similar results to those reported above even if we sequenced more deeply (but perhaps not as deep as to covering >99% of the estimated sequence diversity of the samples, which nonetheless was estimated to require an average of 92 ± 31 Gbp to achieve based on Nonpareil analysis). This also aligned with our observations from sample EN56, which showed that station-specific gene sequence recovery also scales with sequencing effort.
Functional analysis based on COG categories and classes showed that the functional distribution of genes at broad category levels was similar (~1% variance) across all stations and depths except for the Metabolism 1 class and the E category (amino acid transport and metabolism), which had variance of 9.21% and 2% (Fig. S5; File S3, COG tab). The Mobile (X category) class genes had the least variance but also made up the smallest fraction of genes (2% or less). Samples EN21 and EN56 had the greatest proportion of hypothetical (“n/a” category), conserved hypothetical (S category – function unknown), and Mobile (X category) class genes, and the smallest proportion of Metabolism 1 (C, G, E, F, H, and I categories) and Metabolism 2 (Q and P categories) class genes out of all samples (Fig. S1; File S3, COG tab). Of the Metabolism genes, samples EN21 and EN56 had the smallest proportions of categories P (inorganic ion transport and metabolism), I (lipid transport and metabolism), E (amino acid transport and metabolism), and C (energy production and conversion) genes out of all samples. Finally, there were approximately 50,000 genes per sample on average (~22% of predicted CDS) that did not receive an annotation and were not represented in the COG analysis. Sample EN21 had the greatest proportion of unrepresented genes at 38%, which is 8% more than the second-ranking samples in unannotated gene counts (File S3, COG tab).
To probe deeper, we focused on gene sequences that were shared between all samples from the surface to the deep at Station 5 and that received a non-hypothetical functional annotation. We mapped reads from each metagenome to each individual gene sequence originating from that metagenome and calculated a normalized relative abundance (see methods) for each gene in each sample (Fig. 4, “Single Genes” as blue circles). Next, we summed the relative abundance of all genes assigned the same gene function for each sample (Fig. 4 “Summed Genes” as red circles). We computed correlations of summed genes against density, which was strongly correlated with temperature and depth (File S3, Gene Counts, Abundance, and Correlation tabs). The 7,508 shared gene sequences from Station 5 were assigned to 2,544 gene functions. Of these, 1,108 functions showed strong positive correlation (Pearson’s r ≥ 0.5) with density. An additional 339 functions showed strong negative correlation (Pearson’s r ≤ 0.5) with density, and the remaining 1,097 functions showed weak or no correlation (−0.5 < Pearson’s r < 0.5). This analysis showed 86 distinct dehydrogenase genes and 27 oxidoreductase genes that were positively correlated with depth (Pearson’s r ≥ 0.5; Fig. 4C and I). Likewise, it showed 21 distinct dehydrogenase genes and three oxidoreductase genes that were negatively correlated with depth (Pearson’s r ≤ −0.5; Fig. 4F). Additionally, there were 62 positively and 9 negatively correlated CoA-related genes, different 50s ribosomal proteins (Fig. 4A and D), and phosphofructokinases (Fig. 4B and E), as well as beta-lactamases and transposases that showed increased abundance in colder, deeper, water masses (Fig. 4G and H). Qualitatively, increased relative abundance of a function commonly coincided with increased genes of the function in the community (i.e., more genomes or species had distinct copies of the same function), but there were also examples of one or only a few gene sequences (or alleles) greatly increasing in abundance (i.e., particular community members carrying this gene became more abundant). We did not report further here on differentially present/absent gene sequences between the surface and the deep because relative abundance of these genes is below detection in some samples and includes over 430,000 gene sequences from Station 5 alone, where 304,072 gene sequences (69% of all CDS from station 5) were specific to a single sample (Fig. 3; File S3, Gene Similarity tab), many of which were of unknown/hypothetical function (Fig. S5; File S3, COG tab).
Selected gene functions that correlated with density at Station 5. Normalized relative abundance (y-axis) is plotted against the sample’s density (σt, x-axis) for select gene functions (A–I), showing values of all individual genes assigned the stated function (blue circles) and the sum of those genes (red circles) for each sample. The corresponding best fit linear trend line with 95% CI (shaded region), and correlation for each category (single genes or summed genes) is also plotted. Note that there were 9 samples collected at Station five from the surface (density = 23.58, temperature = 27.18°C, depth = 3 m, far left points on x-axis) to the deep (density = 27.73, temperature = 4.25°C, depth = 2,107 m, far right points on x-axis) and that depth and temperature correlated with density. For each gene, normalized relative abundance was calculated as TAD80/GEQ, where TAD80 is the 80% truncated average sequencing depth (20% trimmed mean), and GEQ is genome equivalents as reported by MicrobeCensus. Functional annotations for each gene sequence were propagated from the representative sequence of the 40% amino acid gene cluster it was assigned.
Description of rMAGs
In total, we recovered 209 rMAGs from our GoM metagenome samples, of which 154 were good quality with completeness ≥ 50% and contamination ≤ 10%. Of these, 145 shared < 95% AAI compared to MiGA’s NCBI Prok and Tara Oceans MAGs databases, indicating that they likely represent novel species. About half of these MAGs have a match of >95% ANI against the GTDB reference genomes (n = 72), revealing that related genomes have been recovered by previous studies but remain not-yet-named (File S4, rMAGs tabs). Interestingly, 168 of 209 rMAGs (80%) were assembled from only a single sample (File S4, Genomospecies tab), and 67 of 209 rMAGs (32%) were detectable (relative abundance > 0) in only a single sample (File S4, Detection tab). Furthermore, 59 out of the 67 rMAGs detected in a single sample were detected in sample EN56, and eight were detected in sample EN21. The remaining rMAGs exhibited seven distinct patterns in relative abundance across the sample set (Fig. 5). Some rMAGs were detectable in the surface and ML or only the DCM, whereas others were detected in the surface, ML, and DCM. Similarly, some rMAGs were detected in only the aOM and OM, whereas some were detected in the aOM, OM, and deep or the DCM, aOM, and OM. Only four rMAGs were detected in the deep only. No rMAGs were detectable in all depth layers or all samples, but a couple were detectable in both deep and shallow waters, revealing remarkable versatility across the water column. The latter MAGs included Alteromonas macleodii (97% AAI to the most closely related genome available), and Desulfuromonas sp. (40% AAI to Desulfuromonas soudanensis), which showed relatively high abundance at both the surface (0–200 m) and the deep (>1,000 m) (Fig. 5).
rMAG relative abundance in each sample. Relative abundance calculated as TAD80/GEQ. Genome equivalents (GEQ) computed with MicrobeCensus. Top 50 rMAGs with the greatest cumulative abundance are reported on the rows. The MAG number links to additional information in File S3. Taxonomy assignment is the closest matching genome in MIGA’s NCBI Prok database, and the percentage is the AAI of the rMAG to the closest match. Color and shape indicate a sample of rMAG origin. Columns are metagenome samples. Note that apart from sample EN21, samples cluster by depth, and that Fontibacterium commune is detectable, and quite abundant, only in sample EN56. Hierarchical clustering computed with the scipy.cluster.hierarchy.linkage() function from the SciPy package in Python with method=’average’ to calculate cluster distance using the unweighted pair group method with arithmetic mean (UPGMA) algorithm.
DISCUSSION
Collectively, our results show that a large proportion of microbial diversity in the GoM is unknown to the genomes and genes available in the databases, at least at the species level, and even the rMAGs assigned to known genera or families appear to represent predominantly novel species. Only a maximum of 25% of the microbial community was taxonomically classifiable to represent a known (named) species at the read level (Fig. 2F), further corroborating that the GoM water column communities are under-represented in cultured- and culture-independent databases. Therefore, sequencing efforts to characterize even environments that are thought to be well-characterized by now, such as the oceans, remain worthwhile toward cataloging the extant diversity on the planet. It might be the case that the GoM harbors disproportionately more genome diversity than other oceans due to not only its proximity to major human populations and activities but also its uncommon effects of oil seeps and major rivers such as the Mississippi River. The fact that the GoM metagenomes show a different alpha diversity than other previously characterized ocean basins, such as the North and South Atlantic and Pacific oceans (P < 0.05; File S1, Ocean Basin tab) is also consistent with this interpretation.
Another highlight of our analysis is the high number of gene sequences specific to only a single sample. While it is expected to find more shared genes across similar depths (horizontally) than between depths (vertically), the amount of vertically shared genes is rather low, and the proportion of station-specific gene sequences is correspondingly high. For example, our analysis showed that 67% of genes across all samples represent unique sequences at the 95% nucleotide identity level (Fig. 3C; File S3) and that the greatest percentage of shared genes between any two samples was only about 56% at this identity level (Fig. 3A and B; File S3). Our projection is that even with 5 or 10 times greater sequencing effort, the amount of station-specific gene diversity would decrease by less than 10%. Consistent with these results, the 340% increase in sequencing effort for sample EN56 yielded relative to other samples did not result in proportions of sequence recovery or shared gene sequence outside the range of other samples, which also suggested that station-specific gene sequences scale with sequencing effort, presumably due to increased recovery of sequence from the rare biosphere (File S1). These results somewhat echo the large genome-specific genes (e.g., >50% of total genes detected) revealed by the comparative analysis of genomes of isolates of the same species (36), revealing that the concept of the pangenome may apply equally well to whole communities, not only to the diversity within a species. Our results indicate, in addition, that the auxiliary genes constituting the majority of a species’ pangenome are largely not shared among pangenomes of different species in the GoM; otherwise, we would have observed relatively smaller (or more saturated) community pangenomes. Perhaps these results are attributable, at least in part, to the planktonic nature of marine microbes since the parameters of population mixing and migration are controlled by currents, where similar species occupying the same niche space drift apart in separate water parcels and diverge by neutral evolutionary processes. It is also possible that the influence of these currents, water masses, seeps, and river inputs might be more exaggerated by the unique characteristics of the GoM relative to other ocean basins.
It is also notable that the effect of the Mississippi River (or others) on the GoM could be so pronounced that sample EN56, 270 miles from the coast and 600 m deep, had a small but clearly detectable freshwater signal based on the relative abundance of microbial community members. This sample had a taxonomic profile distinct from the other deep-water (and all) samples with the top three most abundant species detected being Fontibacterium commune (4.1%), Candidatus Nanopelagicus limnes (0.8%), and Polynucleobacter acidiphobus (0.7%), all known to inhabit various freshwater or low-brackish conditions and not previously identified at higher salinity ([37–39)]; File S2, Species tab). The relative abundance of these species in other samples was much lower (0.02% ± 0.01% Fontibacterium, and < 0.001% others). The fourth most prominent member of the EN56 community represented a novel species related (at genus level) to Paenibacillus larvae, a honeybee parasite (40), but the relative abundance of this species was consistent with other samples (0.16% ± 0.03%). Sample EN56 was also the only sample below the ML with a high relative abundance of the coastal Synechococcus (0.3%), which was predominant in surface and ML samples (0.3%–0.9%) but reduced in DCM, aOM, OM, and deep samples (<0.05%), consistent with the hypothesis of a deep freshwater influence. Finally, the salinity measured for sample EN56 was in line with the other deep-water samples (~34.9 PSU; File S1), which is far from the low-brackish and freshwater conditions these species have been identified in previously. Of course, with peculiarities such as this, bottle leakage or some process of surface water mixture or sinking could also occur, but we did not note any evidence of bottle linkage, and the community signal from station EN56 was quite distinct from our surface samples. This signal was also not observed in any other GoM metagenome to indicate contamination during library preparation and/or sequencing (the samples were also filtered at sea, not ashore; hence, there is a low chance of contamination by freshwater). Thus, this signal is highly unlikely to be attributed to contamination. Far-reaching freshwater intrusion events have been observed previously based on physicochemical data (41, 42). To our knowledge, this is the first time that such intrusion events have been documented based on molecular data. Furthermore, it is important to note that the freshwater MAGs recruit a relatively small fraction of the total metagenome (about 7%). Thus, it is unlikely that these organisms reproduce (grow) in situ; rather, they were likely found at our sampling site as part of sinking organic material or represent relic DNA from a recent intrusion event. Therefore, these results could represent an interesting line of future investigation on the impacts that such freshwater intrusion events may have on autochthonous microbial assemblages and their diversity.
Another sample, EN21, approximately 50 miles from the Mississippi River outlet, exhibited a freshwater signal as well, although its signal was more evident in the physicochemical measurements than in the identification of freshwater-associated species. The surface at Station 2 showed lower salinity (29.6 vs. 36 PSU) and increased fluorescence (0.9 vs. 0.07 mg/m^3^) with a reduced DCM layer fluorescence peak (0.3 vs. 0.8 mg/m^3^), and the taxonomic profiles of the surface and ML were different, containing a greater proportion of unclassified reads and reduced relative abundance of Prochlorococcus ([Fig. 1B through D and 2F](#F1 F2)) compared to Stations 5 and 8. The surface sample at Station 2 also had a reduced relative abundance of Pelagibacter species. This finding reveals a brackish water community in the surface ocean surrounding the Mississippi River (and likely others) that is largely unknown in current databases and needs to be explored further in future research. However, since the prominent freshwater species detected in sample EN56 are not reflected in samples EN21 or 22, this brackish water community identified at Station 2 does not explain why we detected freshwater species so far offshore in salt water, and below the OM in sample EN56 at Station 5.
We also find it interesting that the MAGs from samples below the DCM seemed to bin easier or at least represent a greater proportion of the community, although the deep metagenomes harbored larger genomes and gene/sequence diversity. Two possible explanations for this are that the surface harbors species that have high intra-species sequence diversity, or such species make a larger fraction of the total community, such as Prochlorococcus and Pelagibacter, and that the surface and the deep communities harbor a similar number of relatively abundant (and thus recoverable by genome binning) taxa, although the deep communities also appear to have more rare taxa, as discussed above. Consistent with the former explanation, our recent work based on single-cell amplified genomes (SAGs) has revealed high intra-species diversity for deep-sea SAR11 genomospecies, for example, intra-species ANI ranging between 91% and 100% vs. 96% and 100% for most, well-sequenced species (35), confirming earlier observations by us (20) and others (43, 44) about the intra-species sequence diversity of these taxa. High intra-species diversity is not handled well by most, if not all, pieces of software for assembly and/or binning, for example, assemblers are tuned to merge sequences that are 97%–98% identical at the nucleotide identity or higher (45).
Several of our results also corroborated previous findings, but with interesting additions and/or expansions. For example, we observed an increase in genome size and the abundance of certain genes with depth as noted previously (21). However, we also detected a distinct signal in a sample located between the OM and the deep layer (sample EN56, mentioned above). Our data also suggest that genome size and other parameters correlate more strongly with density than with depth or temperature alone, indicating that water mass boundaries may play a greater role than pressure or temperature. Additionally, some gene functions that increase with depth are paralleled by similar but distinct functions that decrease with depth—for instance, aldehyde dehydrogenase versus alcohol dehydrogenase, or diphosphate-dependent phosphofructokinase versus 6-phosphofructokinase (Fig. 4C vs. F and B vs. E). Functional analysis based on COG categories and classes showed that the Metabolism 1 class and the E category (amino acid transport and metabolism) had substantial variance with depth. While similar reports of gene functions increasing in the deep have been reported previously (23–25), we find it interesting that many of these gene functions have similar alternate gene functions that are more abundant at the surface, which could indicate possible sequence adaptation to the unique physicochemical properties of the surface vs. deep waters and/or functional differentiation (Fig. 4). Another interesting observation is that increased abundance of a gene frequently corresponded to more distinct copies of those genes found in the community (i.e., being community-wide) versus an increase in the abundance of one or few species carrying that gene (or a specific allele, defined at the >95% nucleotide identity level). A few previous studies suggest that these gene differences could represent important physiological adaptations associated with ocean depth changes in temperature and pressure, such as differential presence of ribosomal proteins involved in catalyzing peptide bond formation (46, 47), or differences in lactate dehydrogenase function (48).
To conclude, there remains much to be learned about the ocean microbiome, and bioinformatics sequence analysis alone is not enough. The overwhelming number of unknown taxa and gene functions requires specific and dedicated efforts to uncover these functions; however, there is still much to be gained from continued sequencing efforts. Understanding genetic differences between species and between divergent populations of the same species requires data on population tracking, migration, and recombination. Sampling efforts should target distinct water masses and the boundaries between them as well as incorporate ocean current models into sampling schemes. Our work here provides a detailed quantitative analysis of microbial communities in the GoM water column sampled while tracking Trichodesmium blooms tracked by satellite imaging on the surface during a period without major oil inputs. It also uncovered some interesting signals to explore and compare against in future research endeavors.
MATERIALS AND METHODS
Sample collection, DNA extraction, and sequencing
Samples were collected from three depth profiles of the GoM on 29 May 2012 aboard the R/V Endeavor (cruise EN509; https://www.bco-dmo.org/dataset/4067) at Stations 2, 5, and 8, representing surface (Sample IDs: EN21, EN51, and EN81), mixed layer (Sample IDs: EN22, EN52, and EN82), deep chlorophyll maxima (Sample IDs: EN23, EN53, and EN83), above oxygen minimum depth (Sample IDs: EN24, EN54, and EN84), oxygen minimum depth (Sample IDs: EN25, EN55, and EN85), and bathypelagic water masses (EN56, EN57, EN58, and EN59; see File S1). Collections were made using 10.7L Niskin bottles on a rosette containing a conductivity–temperature–depth profiler (Sea-Bird SBE 911plus). Typically, two Niskin bottles were used for each sample.
Water samples (~20 L per sample) were pre-filtered through a nylon disk filter (47 mm, 1.6 μm porosity, Whatman, GE Healthcare Bio-sciences, Pittsburgh, PA, USA), and biomass was collected on a glass fiber disc filter (47 mm, 0.2 μm pore size, Vendor) via a peristaltic pump. Disc filters were immediately frozen after collection at −80°C, and DNA extractions were performed back in the laboratory.
DNA was extracted from the disc filters using an enzymatic lysis and phenol protocol as previously described (23). Briefly, the cells were lysed with the addition of lysozyme (1 mg/μL final concentration in 5 mL of lysis buffer per filter) and incubation for 30 min at 37°C. Addition of Proteinase K (1 mg/100 μL lysis buffer with 100 μL 20% SDS) was followed by a 2 h incubation at 55°C. DNA was extracted by phenol:chloroform extraction, followed by ethanol precipitation and wash. DNA was purified using the Ampure XP-Beads (Beckman Coulter). DNA sequencing was performed using a Nextera XT DNA Sample Prep kit, and a paired-end strategy on an Illumina MiSeq sequencer available at the Molecular Core of Georgia Tech.
Metagenome QC, assembly, and binning
Paired Illumina reads for each metagenome sample were trimmed and quality filtered with BBDuk with parameters “qtrim=w,3 trimq=17 minlength=70 tbo=true tossjunk=t cardinalityout=t.” The trimmed and filtered read set for each sample (TRIM) was also normalized (NORM) with BBNorm with parameters “target=30 min=5 prefilter=t.” BBDuk and BBNorm versions were part of the BBMap v38.93 release from 09/21/2021 (49). For each sample, both the TRIM and NORM read sets were assembled separately with IDBA-UD v1.1.3 with default parameters (50). Assembled contigs <1,000 base pairs in length were removed. For each sample, both the TRIM and NORM assemblies were binned separately with MaxBin v2.2.7 with default parameters and MetaBAT v2.12.1 with default parameters (51, 52). This generated four overlapping MAG sets for each metagenome sample (maxbin_trim, maxbin_norm, metabat_trim, metabat_norm), which were dereplicated using the “derep” workflow from MiGA v1.0.0 to select the highest quality MAG from each 95% ANI cluster within each sample (53). The dereplicated MAGs from each sample were then further dereplicated across all samples, creating the GoM representative MAG set (rMAGs). The quality and taxonomic classification of each rMAG was assessed using the “quality” and “classify” workflows from MiGA as well as the “lineage” workflow from CheckM v1.1.3 with default settings and the “classify” workflow from GTDB v1.7.0 with default settings (54, 55). Metagenome data provided in File S1 and rMAG data provided in File S4. The TRIM read sets and contig assemblies were used for all other downstream analyses.
Predicted CDS and functional annotation
Genes were predicted for all assembled contigs using Prodigal v2.6.3 with setting “-m meta” for metagenomes (56). Functional gene annotations for the relative in situ abundance and density correlation analysis were performed with MicrobeAnnotator v1.0.4 with default settings, Diamond v2.0.1, Kofam, Swissprot, Trembl, and RefSeq databases created February 10, 2021 (57, 58). Functional gene annotations for the high-level COG class and category analysis were performed using EggNog mapper v2.1.9 with default settings with the EggNOG 5 database (59, 60).
Metagenome sample diversity
Alpha diversity and sequence coverage were estimated with Nonpareil v3.401 with settings “-T alignment -f fastq -x 100,000” using only the forward read pair from each sample (61). Beta diversity analysis was performed with Simka v1.5.3 with default settings (62). Community taxonomic distribution was estimated with Kraken v2.1.2 and Bracken v2.5 with default settings (63, 64). Gene similarity between metagenome samples was assessed by gene clustering at 95% and 90% nucleotide sequence identity and 70% and 40% amino acid sequence identity using MMSeqs2 v13.45111 with settings “--cov-mode 1 -c 0.5 --cluster-mode 2 --cluster-reassign,” and “--min-seq-id” of 0.95, 0.90, 0.70, or 0.40, respectively (65). A gene was considered shared if a CDS from all samples in the grouping was found within a gene cluster, flexible if it was found in two or more samples in the grouping, and station-specific if it was found in only one sample in the grouping. Groupings are described in Fig. 3 and in the text.
Relative normalized abundance estimates
Average genome size and genome equivalents (GEQ) were estimated with MicrobeCensus v1.1.0, with the setting “-n 100,000,000” to use all reads (66). A separate database was used for assembled contigs from each sample to map reads for the functional gene normalized abundance estimates, and a single database was used containing all rMAGs to competitively map reads for rMAG relative normalized abundance estimates. Databases were created using the makeblastdb command from Magic-Blast, and reads were mapped to the databases with Magic-BLAST v1.5.0 with settings “-infmt fasta -paired -no_unaligned -splice F -outfmt tabular -parse_deflines T” (67). Read mapping results were filtered to retain only the single best hit for each read and to remove alignments with less than 90% match length (calculated as alignment length/read length) and any reads less than 70 base pairs in length. Relative normalized abundance was calculated as TAD80/GEQ, where TAD80 is the truncated average sequencing depth discarding 10% of base pair positions with the highest and lowest read mapping depth (retaining 80%) before computing the average depth (depth at each base pair position/length of positions) to remove any outlier positions that may skew the mean. TAD80, which is the sequencing depth, is normalized by GEQ.
Additional analysis and figures
Apart from the supplemental statistics presented in the Excel files, all other analyses and figures were generated using in-house Python scripts prepared for this study. Scripts were written for and executed with Python version 3.7.10+ and used the following packages: Matplotlib, Seaborn, Scipy, Statsmodels, Pandas, Numpy, and Seawater (68–75). All code has been deposited in the GitHub repository created for this manuscript.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Biller SJ, Berube PM, Dooley K, Williams M, Satinsky BM, Hackl T, Hogle SL, Coe A, Bergauer K, Bouman HA, Browning TJ, De Corte D, Hassler C, Hulston D, Jacquot JE, Maas EW, Reinthaler T, Sintes E, Yokokawa T, Chisholm SW. 2018. Marine microbial metagenomes sampled across space and time. Sci Data 5:180176. doi:10.1038/sdata.2018.17630179232 PMC 6122167 · doi ↗ · pubmed ↗
- 2Thompson LR, Sanders JG, Mc Donald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A, Gibbons SM, Ackermann G, et al.. 2017. A communal catalogue reveals Earth’s multiscale microbial diversity. Earth Microbiome Project C 551:457–463. doi:10.1038/nature 24621 PMC 619267829088705 · doi ↗ · pubmed ↗
- 3Pedrós-Alió C, Manrubia S. 2016. The vast unknown microbial biosphere. Proc Natl Acad Sci USA 113:6585–6587. doi:10.1073/pnas.160610511327302946 PMC 4914187 · doi ↗ · pubmed ↗
- 4Abreu A, Bourgois E, Gristwood A, Troublé R, Acinas SG, Bork P, Boss E, Bowler C, Budinich M, Chaffron S, et al.. 2022. Priorities for ocean microbiome research. Nat Microbiol 7:937–947. doi:10.1038/s 41564-022-01145-535773399 · doi ↗ · pubmed ↗
- 5Logares R. 2024. Decoding populations in the ocean microbiome. Microbiome 12:67. doi:10.1186/s 40168-024-01778-038561814 PMC 10983722 · doi ↗ · pubmed ↗
- 6Cavicchioli R, Ripple WJ, Timmis KN, Azam F, Bakken LR, Baylis M, Behrenfeld MJ, Boetius A, Boyd PW, Classen AT, et al.. 2019. Scientists’ warning to humanity: microorganisms and climate change. Nat Rev Microbiol 17:569–586. doi:10.1038/s 41579-019-0222-531213707 PMC 7136171 · doi ↗ · pubmed ↗
- 7Ameri A. 2014. Marine microbial natural products. Jundishapur J Nat Pharm Prod 9:e 24716. doi:10.17795/jjnpp-2471625625055 PMC 4302407 · doi ↗ · pubmed ↗
- 8Fenical WB. 2020. Marine microbial natural products: the evolution of a new field of science. J Antibiot (Tokyo) 73:481–487. doi:10.1038/s 41429-020-0331-432713942 · doi ↗ · pubmed ↗
