Human xenobiotic metabolism proteins have full-length and split homologs in the gut microbiome

Matthew Rendina; Peter J Turnbaugh; Patrick H Bradley

PMC · DOI:10.1093/g3journal/jkaf131·June 7, 2025

Human xenobiotic metabolism proteins have full-length and split homologs in the gut microbiome

Matthew Rendina, Peter J Turnbaugh, Patrick H Bradley

PDF

Open Access

TL;DR

This paper shows that gut microbes have versions of human proteins involved in drug metabolism, including full-length and split forms, which may help explain how microbes influence drug effects.

Contribution

The study introduces a pipeline to detect both full-length and split homologs of human xenobiotic metabolism proteins in the gut microbiome.

Findings

01

Human xenobiotic metabolism proteins with full-length homologs in gut microbes include short-chain and aldo-keto reductases.

02

Split homologs in microbes are often involved in central metabolism of nucleobase-containing compounds.

03

Twelve drugs may be metabolized by gut microbial split homologs, including 6-mercaptopurine and 5-fluorouracil.

Abstract

Xenobiotics, including pharmaceutical drugs, can be metabolized by both host and microbiota, in some cases by homologous enzymes. We conducted a systematic search for all known human proteins with gut microbial homologs. Because gene fusion and fission can obscure homology detection, we built a pipeline to identify not only full-length homologs, but also cases where microbial homologs were split across multiple adjacent genes in the same neighborhood or operon (“split homologs”). We found that human proteins with full-length gut microbial homologs disproportionately participate in xenobiotic metabolism. While this included many different enzyme classes, short-chain and aldo-keto reductases were the most frequently detected, especially in prevalent gut microbes, while cytochrome P450 homologs were largely restricted to lower-prevalence facultative anaerobes. In contrast, human proteins…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes3

DPYD CYP4F3 XDH

Proteins3

Species3

gut metagenome(species)Mus musculus(house mouse · species)Homo sapiens(human · species)

Chemicals3

6-mercaptopurine

5-fluorouracil

nucleobase

Figures4

Click any figure to enlarge with its caption.

Fig. 1 — Search strategy. a) Diagram showing microbial gene and protein catalog. Human gut microbial genomes, represented as different shades of orange, are collected in UHGG. Genes (orange outlines, multiple shades) encode protein sequences (solid orange blocks above genes, multiple shades). These have been clustered at 90% amino acid identity (dashed lines) to form the UHGP-90 protein family catalog (solid orange blocks, single shade). To save computational time, we search using this UHGP-90 catalog of gut microbial protein families as a search set, then relate these families back to individual UHGG genomes to obtain information about gene order and proximity. b) BLASTP is used to identify cases where microbial UHGP-90 proteins jointly or individually align to human proteins. Proteins that jointly align and are encoded by nearby features on the same genome are termed “split homologs.”

Fig. 2 — Filtering criteria. Steps 1 and 2 involve finding human proteins (“best human hit”) that align to at least two-thirds of a microbial protein (“microbe coverage”). These steps are common to both pipelines (“all homologs”). For full-length homologs, alignments are then filtered based on whether they individually cover ≥70% of a human protein (“human coverage”). Finally, alignments more than 3 SDs above the mean amino acid identity (here, 55%) are excluded (“contamination”). For split homologs, microbial proteins aligning to a human protein must be encoded in the same genome (“same genome”), cover ≥70% of the human protein (“joint human coverage”), be at most 3 features apart (“feature distance”), and as before, have <55% amino acid identity (“contamination”). Note that in the “split homologs” section, step 3 is re-run after steps 4, 5, and 6. See Methods for details.

Fig. 3 — Gut microbial split homologs aligned to human proteins. Each panel shows alignments of sets of gut microbial sequences (see Table 2) to a single human protein. Each microbial gene is a line segment, colored by phylum and bounded by black dots. Microbial genes from the same genome are on the same y-axis position. Only microbial sequences that were found to be part of a neighborhood were retained, and only human proteins with more neighborhood than full-length orthologs are shown. Because both the number of microbial homologs per human protein and the lengths of the human protein sequences vary, coordinates have been transformed to between 0 and 1.

Fig. 4 — Phylogenetic distribution of full-length gut bacterial xenobiotic homologs. The inset is a midpoint-rooted tree of all bacteria included in UHGG v1.0. Numbered ring segments indicate selected bacterial families (see Tables 2 and 3). The first complete ring shows the phylum-level classification. Note that “Firmicutes” contains Bacilli, “Firmicutes_A” contains Clostridia, and “Firmicutes_C” contains Negativicutes. Successive ring tracks mark the species where full-length homologs were identified (colored lines). From inner to outer, these are: short-chain reductases (“sdr,” dark green); aldo-keto reductases (“aldo,” blue–green); aldehyde dehydrogenases (“aldh,” purple); the GDXG group of lipases (“gdxg,” brown); quinone oxidoreductases (“quin,” salmon); type B carboxylesterases (“ester,” orange), arylamine and arylacetamide metabolism (“aryl,” dark gray); flavin-containing mono-oxygenases (“flavin,” red), glutathione-S-transferases (“gst,” violet), cytochromes (“cyto,” light blue), and UDP-glucuronosyl-transferases (“udp,” dark brown). Ring tracks are plotted in order of Faith's phylogenetic diversity (PD), descending from inside to outside. Numbered families: 1. Lachnospiraceae, 2. Bacteroidaceae, 3. Oscillospiraceae, 4. Acutalibacteriaceae, 5. Ruminococcaceae, 6. Enterobacteriaceae, 7. Burkholderiaceae, 8. Lactobacillaceae, 9. Mycobacteriaceae, and 10. Paenibacillaceae.

Tables4

Table 1.. Table showing all human proteins with more split than full-length homologs (nSplit, nFull) in gut bacteria.

Entry	Entry name	Description	nFull	nSplit
P54886	P5CS_HUMAN	Delta-1-pyrroline-5-carboxylate synthase	27	293
P22102	PUR2_HUMAN	Trifunctional purine biosynthetic protein adenosine-3	0	200
P22234	PUR6_HUMAN	Bifunctional phosphoribosylaminoimidazole carboxylase/phosphoribosylaminoimidazole succinocarboxamide synthetase	0	145
P27708	PYR1_HUMAN	Multifunctional protein CAD	0	120
Q9NZB8	MOCS1_HUMAN	Molybdenum cofactor biosynthesis protein 1	0	75
P11586	C1TC_HUMAN	C-1-tetrahydrofolate synthase, cytoplasmic	0	64
O14841	OPLA_HUMAN	5-oxoprolinase	7	54
O95479	G6PE_HUMAN	GDH/6PGL endoplasmic bifunctional protein	7	47
Q6UB35	C1TM_HUMAN	Monofunctional C1-tetrahydrofolate synthase, mitochondrial	4	47
Q06278	AOXA_HUMAN	Aldehyde oxidase	3	46
P47989	XDH_HUMAN	XDH/oxidase	0	38
O95340	PAPS2_HUMAN	Bifunctional 3′-phosphoadenosine 5'-phosphosulfate synthase 2	0	28
O43252	PAPS1_HUMAN	Bifunctional 3'-phosphoadenosine 5'-phosphosulfate synthase 1	0	27
Q9NQX3	GEPH_HUMAN	Gephyrin	24	26
Q12882	DPYD_HUMAN	Dihydropyrimidine dehydrogenase	0	24
P11172	UMPS_HUMAN	Uridine 5'-monophosphate synthase	0	22
Q709F0	ACD11_HUMAN	Acyl-CoA dehydrogenase family member 11	0	12
Q86Z14	KLOTB_HUMAN	Beta-klotho	0	11
Q7Z3D6	GLUCM_HUMAN	D-glutamate cyclase, mitochondrial	0	6
Q9H9Y6	RPA2_HUMAN	DNA-directed RNA polymerase I subunit RPA2	2	5
Q9UEF7	KLOT_HUMAN	Klotho	0	5
P51659	DHB4_HUMAN	Peroxisomal multifunctional enzyme type 2	0	1
Q8IYQ7	THNS1_HUMAN	Threonine synthase-like 1	0	1

Table 2.. Number of unique species per family in UHGG (rows) where at least 1 homolog in a particular xenobiotic enzyme class (columns) was detected.

Phylum	Family	sdr	aldo	aldh	gdxg	quin	ester	aryl	flavin	gst	cyto	udp
Firmicutes_A	Lachnospiraceae	192	178	170	129	35	105	27	4	0	1	0
Bacteroidota	Bacteroidaceae	116	91	26	18	26	67	4	0	0	0	0
Proteobacteria	Enterobacteriaceae	108	105	108	63	107	55	62	8	107	1	19
Firmicutes	Lactobacillaceae	59	59	49	11	48	3	0	0	0	0	0
Firmicutes_A	Ruminococcaceae	54	47	34	26	10	21	5	0	0	0	0
Firmicutes_A	Acutalibacteraceae	45	35	31	21	18	14	2	0	0	0	0
Firmicutes_A	Oscillospiraceae	43	36	37	18	11	17	11	1	1	0	1
Firmicutes_A	Clostridiaceae	31	22	30	12	5	2	0	8	0	3	1
Campylobacterota	Campylobacteraceae	25	6	21	1	0	1	0	0	0	6	0
Proteobacteria	Burkholderiaceae	22	16	22	9	13	12	2	7	21	4	0
Firmicutes	Enterococcaceae	20	20	18	12	20	5	3	1	1	0	0
Actinobacteriota	Mycobacteriaceae	19	20	20	10	20	17	2	9	0	2	0
Firmicutes_I	Paenibacillaceae	19	19	18	12	19	12	6	2	0	7	3
Firmicutes	Staphylococcaceae	14	14	14	3	14	10	14	1	0	0	3
Firmicutes	Bacillaceae_A	11	11	11	4	10	3	4	4	0	3	3
Proteobacteria	Moraxellaceae	11	9	11	9	11	0	1	8	11	0	0
Bacteroidota	Tannerellaceae	11	11	10	2	7	5	2	0	0	0	4
Proteobacteria	Pseudomonadaceae	10	9	10	6	10	1	2	7	10	1	3
Firmicutes	Amphibacillaceae	7	7	7	0	7	1	3	1	0	6	0
Firmicutes	Bacillaceae	4	4	4	0	4	4	2	4	0	4	4
Firmicutes	Bacillaceae_G	4	4	4	3	4	0	4	4	0	3	4
Proteobacteria	Xanthobacteraceae	3	3	3	3	3	1	2	3	3	3	1

Table 3.. Number of unique bacterial homologs (i.e. distinct UHGP-90 protein families) of proteins in a particular xenobiotic class (columns) detected per bacterial family in UHGG (rows).

Phylum	Family	sdr	aldo	aldh	gdxg	quin	ester	aryl	flavin	gst	cyto	udp
Firmicutes_A	Lachnospiraceae	3,672	863	558	216	54	247	32	4	0	1	0
Bacteroidota	Bacteroidaceae	1,961	475	64	22	28	168	5	0	0	0	0
Proteobacteria	Enterobacteriaceae	1,503	167	595	57	259	42	40	7	157	1	15
Firmicutes	Lactobacillaceae	569	301	75	12	176	3	0	0	0	0	0
Firmicutes_A	Ruminococcaceae	1,345	233	103	63	20	81	9	0	0	0	0
Firmicutes_A	Acutalibacteraceae	1,120	270	122	67	42	80	8	0	0	0	0
Firmicutes_A	Oscillospiraceae	1,571	255	165	48	26	52	18	1	1	0	1
Firmicutes_A	Clostridiaceae	326	58	91	12	6	2	0	12	0	2	1
Campylobacterota	Campylobacteraceae	87	16	12	1	0	1	0	0	0	4	0
Proteobacteria	Burkholderiaceae	687	179	231	17	59	42	3	8	115	5	0
Firmicutes	Enterococcaceae	218	75	33	11	74	4	2	1	1	0	0
Actinobacteriota	Mycobacteriaceae	259	35	118	19	52	19	2	11	0	22	0
Firmicutes_I	Paenibacillaceae	589	66	64	19	69	15	6	2	0	10	3
Firmicutes	Staphylococcaceae	169	40	68	3	27	9	14	1	0	0	2
Firmicutes	Bacillaceae_A	257	31	110	5	40	3	4	4	0	6	3
Proteobacteria	Moraxellaceae	200	8	99	18	31	0	1	13	27	0	0
Bacteroidota	Tannerellaceae	519	117	30	3	8	21	1	0	0	0	1
Proteobacteria	Pseudomonadaceae	356	22	165	9	74	1	2	16	58	1	3
Firmicutes	Amphibacillaceae	202	38	70	0	33	1	3	1	0	8	0
Firmicutes	Bacillaceae	79	7	20	0	6	4	3	4	0	6	3
Firmicutes	Bacillaceae_G	37	3	11	4	22	0	6	4	0	2	3
Proteobacteria	Xanthobacteraceae	225	11	44	4	35	1	3	10	62	9	1

Table 4.. Table showing all identified drug-metabolizing enzymes with more partial (nPart) than full-length (nFull) gut microbial homologs, together with the drug(s) they metabolize (From).

Enzyme	Description	nPart	nFull	From
AOX1	Aldehyde oxidase	46	3	Crizotinib, allopurinol, ziprasidone, vortioxetine, aciclovir, pyrazinamide, capmatinib, nicotine iminium ion, thioguanine
XDH	Xanthine dehydrogenase/oxidase	38	0	Doxorubicin, 1-methylxanthine, theophylline, allopurinol, pyrazinoic acid, pyrazinamide, mercaptopurine, thioxanthine
DPYD	Dihydropyrimidine dehydrogenase	26	24	Fluorouracil
UMPS	Uridine 5'-monophosphate synthase	22	0	Fluorouracil

Funding1

—National Institutes of Health10.13039/100000002

Keywords

microbiomexenobioticsmetagenomicsgenomicsgutpharmaceuticalhomologybioinformaticspipelinebacteria

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGut microbiota and health · Diet and metabolism studies · Metabolism and Genetic Disorders

Full text

Introduction

Hundreds of small molecules, including drugs, can be metabolized by both human cells and also the trillions of microorganisms that colonize the gastrointestinal tract (the gut microbiota) (Koppel et al. 2018; Spanogiannopoulos et al. 2022). In some cases (e.g. digoxin), drug metabolism by the microbiome can contribute to observed differences in pharmacodynamics across patients. When drugs have a narrow therapeutic window, microbial metabolism can be especially relevant, as even small differences in concentration can lead to large changes in toxicity or efficacy (Dobkin et al. 1983). However, cases of microbial drug metabolism can be difficult to identify and are time- and labor-intensive to characterize; for example, metabolism of digoxin by gut microbes was first reported in 1981 (Lindenbaum et al. 1981), but the gene responsible was not identified until 2013 (Haiser et al. 2013). This has led to interest in using bioinformatic methods that use chemical similarity together with homology to find gut microbial proteins that may metabolize a given small molecule, such as GutBug (Malwe et al. 2023), MicrobeFDT (Guthrie et al. 2019), and SIMMER (Bustion et al. 2023).

Drug-metabolizing proteins may be part of general “xenobiotic” systems that transform or detoxify natural products, often with broad specificity. In humans, these systems include cytochrome P450 proteins and glutathione-S-transferases. Many drugs are derived from natural products that could be encountered in the environment. These natural products typically exhibit high structural diversity, both because they are ammunition in “arms races” between competitors, and because of other constraints on natural product enzyme evolution (Noda-Garcia and Tawfik 2020). We therefore might expect to find xenobiotic metabolic genes with less specific substrate requirements in both hosts and microbiome. In contrast, other drug-metabolizing proteins have primary roles in central metabolism. These proteins typically have narrower specificity and metabolize drugs that are structurally similar to their natural substrates, regardless of whether they are found in nature, such as nucleoside analogs. Since many central metabolic proteins are evolutionarily ancient, one might also expect to find cases of direct homology between host and microbiome drug-metabolizing proteins. This may be especially true for drugs like chemotherapeutic or immunomodulatory antimetabolites, as these target conserved parts of metabolism.

Most approaches to detecting microbe-host homology have focused on single genes. However, horizontal transfer, multidomain protein architectures, and gene fusions can complicate the detection of such homologs (Stolzer et al. 2015; Darby et al. 2017; Méheust et al. 2018). Fusion and transfer have often been observed together, as in the case of a Drosophila ananassae innate immune gene that ultimately derived from two bacteriophage toxin genes (Tarnopol et al. 2025), or a carotenoid biosynthetic gene that was fused and transferred into nonphotosynthetic protists (Rius et al. 2023). Indeed, a recent systematic study found that out of 33 bacterial operons that were transferred into specific eukaryotic lineages, 20 appeared as fusions in the recipient (Kogay et al. 2025). Furthermore, early eukaryotes also appear to contain a set of nearly 300 “symbiogenetic” or “S-genes,” which are fusions of prokaryotic components: of these, S-genes with bacterial components were especially enriched for metabolic processes (Méheust et al. 2018). Gene fusions in eukaryotes may be common because they ensure coexpression in the absence of multicistronic operons, and possibly also because they prevent metabolic intermediates, which may be toxic, from diffusing away in the eukaryote's larger cell volume (Méheust et al. 2018; Kogay et al. 2025).

Taken together, this implies that gut microbial genomes may contain direct homologs to human drug metabolism genes. Some of these may be part of more general systems, while others may be central metabolic genes that happen to also metabolize designed substrate analogs. While individual cases have been identified, the full extent of such homology has remained unknown, as did whether particular systems or enzymes are particularly likely to be shared across hosts and microbes. Furthermore, in some cases, these microbial homologs may be encoded by multiple adjacent open reading frames; such “split” homologs would be missed by a one-to-one homology search. Leveraging recently published collections of metagenome-assembled genomes (MAGs) from the human gut microbiome (Almeida et al. 2021), we therefore aimed to comprehensively identify gut microbial proteins that are either full-length or “split” human homologs, then determine, based on curated human annotations, which of these were most likely to participate in xenobiotic metabolism, and which specific roles those proteins are most likely to play.

Materials and methods

Identification of split and full-length homologs: overview

Our approach for identifying gut microbial homologs is described in Fig. 1. Briefly, we conducted a BLASTP homology search between gut microbial protein families and the human reference proteome. Our gut microbial sequences came from the Universal Human Gut Proteome database (UHGP) (Almeida et al. 2021), which contains predicted protein sequences from >200 K isolate and MAGs. Because UHGP contains a very large number of nonidentical protein sequences (>170 M), we used a derivative of UHGP clustered at 90% amino acid identity (UHGP-90), which retains most of the sequence diversity at less than one-tenth the size (14 M protein families). Our human proteome consisted of all 20.6 K reviewed human proteins downloaded from UniProt on 2023/9/13; because we wanted to compare our results against multiple databases, we did not limit our initial search to only known drug metabolism genes.

Search strategy. a) Diagram showing microbial gene and protein catalog. Human gut microbial genomes, represented as different shades of orange, are collected in UHGG. Genes (orange outlines, multiple shades) encode protein sequences (solid orange blocks above genes, multiple shades). These have been clustered at 90% amino acid identity (dashed lines) to form the UHGP-90 protein family catalog (solid orange blocks, single shade). To save computational time, we search using this UHGP-90 catalog of gut microbial protein families as a search set, then relate these families back to individual UHGG genomes to obtain information about gene order and proximity. b) BLASTP is used to identify cases where microbial UHGP-90 proteins jointly or individually align to human proteins. Proteins that jointly align and are encoded by nearby features on the same genome are termed “split homologs.”

We then identified cases where a human protein aligned to a gut microbial protein along ≥67% of its length. To find full-length homologs, we kept the best microbial match per genome that also aligned to ≥70% of the human protein. To find split homologs, we identified sets of gut microbial proteins that were jointly, but not individually, homologous to ≥70% of the human protein, and that were also encoded by adjacent or near-adjacent genes on the same strand of the same gut microbial assembly (see Figs. 1 and 2, and Methods). We picked these cutoffs based on previous studies that systematically identified orthologs [for example, the O’Brien et al. INPARANOID algorithm uses ≥50% (Remm et al. 2001), the Halachev et al. (2011) analysis of bacterial and archaeal orthologs uses ≥70%, and the Rubin et al. (2000)) analysis of model eukaryotes uses ≥80%]. In addition, we performed a sensitivity analysis where we varied these coverage thresholds: we tested microbial protein cutoffs of 50%, 67%, and 75%, and human protein cutoffs of 60%, 70%, and 80%, for a total of 9 parameter sets. For the pathway enrichment and xenobiotic class analysis, we used the results from the most (50%/60%) and least stringent (75%/80%) cutoffs, in addition to our default (67%/70%).

Filtering criteria. Steps 1 and 2 involve finding human proteins (“best human hit”) that align to at least two-thirds of a microbial protein (“microbe coverage”). These steps are common to both pipelines (“all homologs”). For full-length homologs, alignments are then filtered based on whether they individually cover ≥70% of a human protein (“human coverage”). Finally, alignments more than 3 SDs above the mean amino acid identity (here, 55%) are excluded (“contamination”). For split homologs, microbial proteins aligning to a human protein must be encoded in the same genome (“same genome”), cover ≥70% of the human protein (“joint human coverage”), be at most 3 features apart (“feature distance”), and as before, have <55% amino acid identity (“contamination”). Note that in the “split homologs” section, step 3 is re-run after steps 4, 5, and 6. See Methods for details.

Identification of split and full-length homologs: pipeline details

Our initial BLASTP (Altschul et al. 1990) search of all 13.9 M proteins in the UHGP-90 database against our reference human proteome from UniProt yielded 8.5 M potential matches. We then performed the following filtering steps:

Best human hit: for each UHGP-90 protein, retain only the human protein with the highest bitscore;Microbe coverage: retain only alignments covering at least 67% of the prokaryotic UHGP-90 protein sequence; additionally, retain only UHGP-90 sequences that are at least 80 amino acids long.

The pipeline diverged after this point for full-length and split homologs. For full-length homologs, we were interested in individual microbial proteins where an alignment covered most of the human protein, and where this alignment was unlikely to be due to contamination in the microbial genomes. We therefore performed the following filtering steps:

F3. Human coverage: each alignment must cover at least 70% of the human sequence;F4. Contamination: filter out any alignments whose amino acid percent ID was more than 3 SDs above the mean (mean: 32.3%; SD: 7.6%; cutoff: 55.2%).

For split homologs, we sought to determine which UHGP-90 families were encoded by neighboring features in the same genome, and where this was unlikely to be the result of contamination, as above. Because multiple genomes could encode the same UHGP-90 family, we had to first expand our results, then filter, as follows:

S3. Same genome: first, determine which individual UHGG genomes encoded multiple UHGP-90 families aligning to each individual human protein. Then, retain only those alignments, repeated for each genome that encoded the UHGP-90 protein;S4. Joint human coverage: for each UHGG genome and for each human protein, determine whether the alignments between the human protein and the UHGP-90 proteins from that genome could jointly, but not individually, cover at least 70% of the human sequence;S5. Feature distance: for each UHGG genome and for each human protein, compute the minimum distance (in feature numbers) between each UHGP-90 protein aligning to that human protein. Retain only sets of alignments with at least 2 different UHGP-90 proteins that are 3 or fewer features apart, and that are additionally all on the same strand and contig. Then, repeat the human coverage step to ensure that the remaining alignments still jointly cover ≥70% of the human protein, as some have been removed;S6. Contamination: filter out any alignments whose amino acid percent ID was more than 3 SDs above the mean observed for all full-length homologs, as above (mean: 32.3%; cutoff: 55.2%); repeat the human coverage step again and report results.

Filtering and analysis steps were carried out in a Snakemake (Mölder et al. 2021) pipeline, using Pandas (The Pandas Development Team 2020), Polars (Vink et al. 2023), and R with Tidyverse (Wickham et al. 2019; R Core Team 2024).

As mentioned above, we also performed a sensitivity analysis to examine the impact of varying the microbe coverage and human coverage parameters. We set microbe coverage to 50%, 67%, or 75%, and set human coverage to 60%, 70%, or 80%, for a total of 9 parameter combinations. The results are given in Supplementary Tables 1 and 2.

Enrichment analysis

Gene Ontology (GO) annotations (Ashburner et al. 2000; Aleksander et al. 2023) downloaded from UniProt on 2024/10/24 (The UniProt Consortium 2021) were used to determine subcellular localization (“cellular component”) and function (“biological process”) for human proteins. Proteins that were no longer present in the 2024/10/24 version of the database were dropped (this applied to only 4 proteins, which were all pseudogene products). Proteins whose cellular component annotations matched the regular expression “[Mm]itochondr” were retained as mitochondrially localized. Enrichment analysis was carried out using TopGO (Alexa and Rahnenfuhrer 2024) using Fisher's exact test on GO biological process terms, with the resulting p-values corrected for multiple testing using the Benjamini–Hochberg method (Hochberg and Benjamini 1995).

We also repeated the analysis with 2 different subsets of GO biological process annotations based on their evidence codes, downloaded from GO on 2025/03/26 (Carbon and Mungall 2025). The first subset (“noIEA”) excluded all annotations with the evidence code “IEA” (Inferred from Electronic Annotation), which are not manually reviewed; the second subset (“expOnly”) included only annotations with the following evidence codes, where a nonhigh-throughput experiment directly implicated a gene in a function:

“EXP”: Inferred from Experiment“IDA”: Inferred from Direct Assay“IPI”: Inferred from Physical Interaction“IMP”: Inferred from Mutant Phenotype“IGI”: Inferred from Genetic Interaction“IEP”: Inferred from Expression Pattern

The corrected P-values for the unrestricted (“all”), “noIEA,” and “expOnly” sets are listed in Supplementary Tables 3 and 4. We also examined how varying the microbe and human coverage thresholds affected these results, with the full results in Supplementary Tables 5 and 6.

Analyzing subclasses of xenobiotic enzyme families

Xenobiotic enzyme families were defined using protein family annotations in UniProt, using regular expression matches for the following:

Aldo-keto reductases (“akr”): “Aldo/keto reductase family”;UDP-glucuronosyltransferases (“udp”): “UDP-glycosyltransferase family” (note: all human members of this family except cerebroside synthase were annotated as UDP-glucuronosyltransferases);Glutathione S-transferases (“gst”): “GST superfamily” (note: this included the alpha, zeta, sigma, pi, mu, theta, omega, and kappa families);Arylamine N-acetyltransferases (“aryl”): “Arylamine N-acetyltransferase family”;GDXG-like hydrolases (“gdxg”): “'GDXG” lipolytic enzyme family”;Cytochrome P450 s (“cyto”): “Cytochrome P450 family”;Type B carboxylesterases (“ester”): “Type-B carboxylesterase/lipase family”;Flavin mono-oxygenases (“flavin”): “Flavin monoamine oxidase family|FMO family” (note: this included the FIG1 subfamily);Short-chain reductases (“sdr”): “Short-chain dehydrogenases/reductases (SDR)”;Quinone oxidoreductases (“quin”): “Quinone oxidoreductase subfamily”.

Full-length homologs were partitioned into one of the above classes. Next, for each class, the phylogenetic diversity of species containing at least 1 full-length homolog was calculated using Faith's PD (Faith 1992). The phylogeny used was the maximum-likelihood tree of the 4,616 species in UHGG (Almeida et al. 2021) generated via IQ-TREE (Nguyen et al. 2015), which we midpoint-rooted using APE (Paradis et al. 2004). Faith's PD was calculated using Picante (Kembel et al. 2010). Xenobiotic classes were visualized in descending order of PD. Results were visualized using the R package ggtree (Yu et al. 2017).

Identification of gut homologs of drug-metabolizing enzymes

Pathway, gene, relationship, and chemical annotations were downloaded from PharmGKB (2024/10/12) (Whirl-Carrillo et al. 2021). HUGO Gene Nomenclature Committee (HGNC) identifiers (Seal et al. 2023) in PharmGKB were mapped, using data downloaded from HGNC (2024/10/01), to UniProt IDs. Chemicals of interest in PharmGKB were defined as having the chemical classes “Drug,” “Drug Class,” “Prodrug,” or “Metabolite” (this refers to drug metabolites, not endogenous substrates or “Biological Intermediates,” which were excluded). This was necessary because certain endogenous human metabolites were annotated in pathways, but only peripherally related to drug metabolism, e.g. homocysteine in the methotrexate metabolism pathway (present because methotrexate targets folate biosynthesis, which in turn is linked to homocysteine via the S-adenosyl-methionine cycle). Reactions from all PharmGKB pathways were then filtered such that the reactant and products were different, the reaction type was not “Transport,” the “Controller” (typically an enzyme or regulator) was known, and either the reactant or product (or both) was a chemical of interest as defined above.

Identification of other xenobiotic enzyme types in PharmGKB

In addition to the xenobiotic enzyme classes we defined above, we also considered 2 other classes of enzymes:

Nucleobase metabolism genes: genes annotated to the GO term “nucleobase-containing compound metabolic process (GO: 0006139),” or any term under it, but excluding genes annotated to the following GO terms or any terms below them as these tended to contain mostly enzymes that use nucleobases as electron or energy carriers instead of being involved in, e.g. nucleobase biosynthesis or salvage:

○“Acyl-CoA metabolic process (GO: 0006637)”;
○“Coenzyme A metabolic process (GO: 0015936)”;
○“FMN metabolic process (GO: 0046444)”;
○“FAD metabolic process (GO: 0046443)”;
○“Pyridine nucleotide metabolic process (GO: 0019362).” Oxidoreductases: genes annotated to the GO term “oxidoreductase activity (GO: 0016491).”

Identification of DesE homologs

The study that showed DesE from Clostridium bolteae metabolizes the ketone group of pharmaceutical steroids (Qian et al. 2022) listed its GenBank (Sayers et al. 2021) protein accession as EDP16280.1. To determine whether this protein was represented in our full-length homologs, we first retrieved this accession and used it to perform an MMSeqs2 (Steinegger and Söding 2017) search in “easy-search” mode against all UHGP-90 protein sequences. The best-hit protein (GUT_GENOME228173_01934, 97.7% identity) was then used to filter our list of full-length homologs. GUT_GENOME228173_01934 was found to be a full-length homolog of the human protein PECR (UniProt ID Q9BY49), and was distributed in 6 Lachnospiraceae species; these included Clostridium_M bolteae, where it was discovered, and 2 other species in the genus Clostridium_M.

Results

Thousands of human proteins have either full-length or split homologs in the gut microbiome

Overall, we found that homology between the human and gut microbial proteomes was not rare: 12.5% of human proteins (2,569) had at least 1 gut microbial homolog. Furthermore, while the majority had full-length homologs, a sizable minority (407) had at least 1 split homolog. In fact, 23 human proteins had more split than full-length homologs, and 16 had no full-length homologs at all (Fig. 3 and Table 1), meaning that they could not have been found by a conventional one-to-one homology search. These numbers, and the relative ratio of full-length to split homologs, are in line with previous estimates of eukaryotic gene families that likely descended from fusions of prokaryotic proteins or domains. Méheust et al. (2018) identified 282 such families associated with eukaryogenesis, 19 of which were both widely distributed in eukaryotes and also “operon-like” in bacteria, in that they appeared in an annotated operon in at least 1 bacterial genome. Kogay et al. (2025), using a different strategy, found 309 single genes and 49 operons that had likely been horizontally transferred into eukaryotic lineages.

Gut microbial split homologs aligned to human proteins. Each panel shows alignments of sets of gut microbial sequences (see Table 2) to a single human protein. Each microbial gene is a line segment, colored by phylum and bounded by black dots. Microbial genes from the same genome are on the same y-axis position. Only microbial sequences that were found to be part of a neighborhood were retained, and only human proteins with more neighborhood than full-length orthologs are shown. Because both the number of microbial homologs per human protein and the lengths of the human protein sequences vary, coordinates have been transformed to between 0 and 1.

A sensitivity analysis also showed that our results were robust to the main adjustable parameters, namely, the alignment coverage thresholds for the microbial and human proteins. Considering the proteins with mostly full-length homologs, a majority would have been found using all 9 tested parameter sets (2,011, or 78%); almost all were found with at least 6 (2,523, or 98%) (Supplementary Table 2). An additional 694 proteins would have been found using at least 1 other parameter set, but none of these were found in more than 3. Similarly, all 23 of the proteins with more split than full-length homologs had the same pattern in at least 6 out of 9 combinations; only 6 additional proteins passed this test in at least 1 parameter set, and none in more than 3 out of 9 (Supplementary Table 1).

Human proteins with full-length vs split homologs differ in function and subcellular localization

After identifying human proteins with full-length and split homologs, we used GO enrichment (Ashburner et al. 2000; Aleksander et al. 2023) to ask whether these 2 groups could be differentiated in localization and function (Supplementary Tables 3 and 4). Considering proteins with mostly full-length homologs, we observed that many of the enriched pathways were mitochondrial, such as “tricarboxylic acid cycle” ( $[eqn]$ ), “fatty acid beta-oxidation” ( $[eqn]$ ), and “carnitine metabolic process” ( $[eqn]$ ). Because the eukaryotic mitochondrion descends from a bacterial ancestor, we might expect human proteins with gut bacterial homologs to localize to the mitochondrion. Indeed, human proteins with full-length homologs were much more likely to localize to the mitochondrion (odds ratio 4.8, 95% CI [4.3, 5.5], $[eqn]$ , Fisher's exact test). Further, this enrichment increased the more frequently the full-length homologs were detected (Supplementary Fig. 1). This set of enrichments aligns strongly with previous work that identified a set of nuclear gene families present in the last eukaryotic common ancestor that had mainly Alphaproteobacterial origins, mitochondrial localization, and roles in energy production (Pittis and Gabaldón 2016).

Remarkably, the most-enriched term among proteins with full-length homologs was “xenobiotic metabolic process” ( $[eqn]$ ). There was an equally strong enrichment when considering only nonmitochondrial genes. Further, the enrichment was not driven by a single enzyme family. We observed homologs of short-chain and aldo-keto reductases, carboxylesterases, arylamine N-acetyltransferases and arylacetamide deacetylases, glutathione-S-transferases, flavin mono-oxygenases, UDP-glucuronosyl transferases and cytochrome P450 family members, among others. “Xenobiotic metabolic process” remained the absolute strongest enrichment when we excluded unreviewed or nonexperimental evidence codes (Supplementary Table 3) and even when alignment coverage parameters were relaxed or made more stringent (Supplementary Table 5).

We next compared these results with the proteins that had mainly split homologs. In contrast, these were not at all enriched for the “xenobiotic metabolic process” GO term ( $[eqn]$ ), but rather for a smaller number of central pathways, namely purine, pyrimidine, and cofactor (folate and molybdopterin) metabolism ( $[eqn]$ ) (Supplementary Table 4). Proteins with split microbial homologs in these pathways included multifunctional, multidomain proteins like GART, which carries out 3 steps in purine biosynthesis, and CAD, which performs the first 3 steps of de novo pyrimidine biosynthesis. Excluding unreviewed or nonexperimental annotations slightly changed some of the specific terms that were significant (e.g. “pyrimidine nucleoside monophosphate biosynthetic process” vs “pyrimidine nucleoside diphosphate biosynthetic process”), but purine, pyrimidine, and cofactor metabolic processes all remained enriched (Supplementary Table 4). Our sensitivity analysis further showed that purine and pyrimidine pathway enrichments were consistently found across all alignment coverage thresholds tested, and folate and molybdopterin metabolic processes were enriched in 2 out of 3 (Supplementary Table 6). Full-length homologs were also significantly enriched for some of these pathways (Supplementary Table 3), but less so than the general “xenobiotic metabolism” term, indicating that proteins with split homologs are a more functionally specific group. Proteins in these central pathways, however, still make important contributors to drug metabolism. Nucleoside and folate analogs, in particular, are common antiviral, antibiotic, and chemotherapeutic agents.

Further, when we examined the subcellular distribution of human proteins with mainly split homologs, the fraction localizing to the mitochondria was more modest, and did not differ significantly from the base rate (odds ratio 2.0, 95% CI [0.23, 8.8], $[eqn]$ ). If anything, the proteins with the most split homologs were the least likely to be mitochondrial (Supplementary Fig. 1). Proteins with split homologs therefore appear to participate in different biological processes (cytosolic, primarily central metabolism) than full-length homologs (mitochondrial, energy production, both xenobiotic and central metabolism).

Reductases and hydrolases dramatically outnumber cytochromes and UDP-glucuronosyltransferases in gut microbes

The above analysis indicates the presence of full-length homologs of xenobiotic metabolism enzymes in gut microbes. However, it does not tell us about their phylogenetic distribution, which is important because gut microbial clades vary in their prevalence and average abundance across orders of magnitude (Abdill et al. 2025). We therefore identified eleven enzyme families with at least some members known to participate in human xenobiotic metabolism, then determined which gut microbial species contained homologs of these families (Fig. 4 and Table 2), as well as how many distinct bacterial proteins were identified (Table 3).

Phylogenetic distribution of full-length gut bacterial xenobiotic homologs. The inset is a midpoint-rooted tree of all bacteria included in UHGG v1.0. Numbered ring segments indicate selected bacterial families (see Tables 2 and 3). The first complete ring shows the phylum-level classification. Note that “Firmicutes” contains Bacilli, “Firmicutes_A” contains Clostridia, and “Firmicutes_C” contains Negativicutes. Successive ring tracks mark the species where full-length homologs were identified (colored lines). From inner to outer, these are: short-chain reductases (“sdr,” dark green); aldo-keto reductases (“aldo,” blue–green); aldehyde dehydrogenases (“aldh,” purple); the GDXG group of lipases (“gdxg,” brown); quinone oxidoreductases (“quin,” salmon); type B carboxylesterases (“ester,” orange), arylamine and arylacetamide metabolism (“aryl,” dark gray); flavin-containing mono-oxygenases (“flavin,” red), glutathione-S-transferases (“gst,” violet), cytochromes (“cyto,” light blue), and UDP-glucuronosyl-transferases (“udp,” dark brown). Ring tracks are plotted in order of Faith's phylogenetic diversity (PD), descending from inside to outside. Numbered families: 1. Lachnospiraceae, 2. Bacteroidaceae, 3. Oscillospiraceae, 4. Acutalibacteriaceae, 5. Ruminococcaceae, 6. Enterobacteriaceae, 7. Burkholderiaceae, 8. Lactobacillaceae, 9. Mycobacteriaceae, and 10. Paenibacillaceae.

While cytochrome P450s are one of the most important and well-studied xenobiotic detoxification systems in humans, we saw relatively few homologs in gut microbes, finding them only in 1.4% of species. Using the most relaxed thresholds for alignments only raised this to 1.6%, suggesting that this is not a detection issue. Cytochrome P450 homologs were also mainly restricted to facultative anaerobes, which makes sense given the oxygen-dependent mechanisms of these proteins. Furthermore, the species with the most homologs were in low-prevalence families like the Paenibacillaceae (Table 2), which in a recent large meta-analysis was only found in 1.6% of human gut microbiome samples (Abdill et al. 2025). This suggests that while cytochrome P450 homologs can be found, they may be less likely to be relevant to the adult human gut, with the important caveat that gut oxygenation can also vary across development and in disease (Zong et al. 2024). Flavin-dependent mono-oxygenases and UDP-glucuronosyltransferases had similarly sparse distributions and were mostly found in lower-abundance gut microbes.

Two enzyme types had intermediate distributions. First, glutathione S-transferase homologs were much more commonly detected than cytochrome P450s, but were almost exclusively found in Proteobacterial facultative anaerobes, like the Enterobacteriaceae and Burkholderiaceae. However, while these are typically low-abundance, Enterobacteriaceae are prevalent, and Proteobacteria can rise to high levels in certain individuals and situations, making them major contributors to functional variability (Bradley and Pollard 2017). This suggests that gut microbial GST activity might also be especially variable across individuals. Second, arylamine acetylases were most observed in facultative anaerobes (Enterobacteriaceae and Staphylococcaceae), yet were also detected in certain Lachnospiraceae, the most prevalent gut microbial family worldwide (Abdill et al. 2025). Substrates for these genes include the antihypertensive vasodilator hydralazine (Spinasse et al. 2014) and the antitubercular isoniazid (Butcher et al. 2002). Interestingly, it has been previously shown that isoniazid is also metabolized by strains of M. tuberculosis by an arylamine acetylase homologous to human NAT2 (Payton et al. 1999). Finally, arylamine acetylases are also responsible for both increasing and decreasing the carcinogenicity of certain environmental pollutants, suggesting that gut microbes could also modulate these risks (Hein et al. 1993).

In contrast, we detected thousands of short-chain and aldo-keto reductases in common gut microbes, like Lachnospiraceae, Enterobacteriaceae, and Bacteroidaceae (Table 3). In humans, both classes of enzymes act on a wide range of substrates; notably, certain members can participate in the reduction of steroid-like and polycyclic molecules, including bile acid intermediates (Kavanagh et al. 2008; Penning et al. 2019). Gut microbes are known for their ability to transform primary bile acids into secondary bile acids, and this metabolism has well-studied consequences for immune and metabolic signaling in the host (Collins et al. 2023). Additionally, the Lachnospiraceae member C. bolteae was recently found to directly metabolize the steroids nabumetone, hydrocortisone, and tacrolimus via the gene desE (Qian et al. 2022). We detected that the UHGP-90 protein with the best hit to desE, GUT_GENOME228173_01934, which is found in several Lachnospiraceae (Supplementary Table 7), appeared to be a full-length homolog of the human protein PECR, a trans-2-enoyl-CoA reductase that is a member of the SDR family. The prominence of reductases in the most common gut microbes aligns with previous observations that reduction reactions are especially common ways for gut microbes to transform xenobiotics, potentially because of the need for alternative electron acceptors in the absence of molecular oxygen (Spanogiannopoulos et al. 2016; Koppel et al. 2017).

Homologs of 2 other redox-active enzyme classes, aldehyde dehydrogenases and quinone oxidoreductases, were also observed frequently in Lachnospiraceae, but the highest number of distinct bacterial homologs was found in facultative anaerobes like Enterobacteriaceae, Lactobacillaceae, or Burkholderiaceae. Enzymes in these families, of course, play roles in both central and xenobiotic metabolism, complicating their interpretation. For example, aldehyde dehydrogenase oxidizes acetaldehyde to acetate (or the reverse, in microbial ethanol production). However, aldehyde dehydrogenase enzymes can have a variety of other substrates (e.g. lactaldehyde; Chen et al. 1987) and are also involved in the detoxification of drugs like cyclophosphamide (Emadi et al. 2009).

Finally, homologs of type B carboxylesterases and the “GDXG” group of lipases (which include hormone-sensitive lipases, arylacetamide deacetylases, and neutral cholesterol esterases) were also found frequently, especially in Lachnospiraceae and, in the former case, Bacteroidaceae. In humans, in addition to deactivating drugs like flutamide (Kobayashi et al. 2012) and indiplon (Shimizu et al. 2014), these hydrolases bioactivate a large number of prodrugs, including enalapril (Thomsen et al. 2014) and irinotecan (Xu et al. 2002). One caveat is that while alignment coverage thresholds had little impact on the detection of most enzyme classes or their predicted phylogenetic distribution, GDXG lipases, along with UDP-glucuronosyl transferases, were both detected in substantially fewer species at the most stringent alignment coverage threshold we tested (Supplementary Table 8).

Overall, this analysis shows that while several systems used by humans to detoxify pharmaceutical, dietary, and environmental compounds do have at least some analog in the gut microbiome, certain enzyme families are much better represented in the most prevalent gut microbes. Specifically, these include redox-active enzymes, especially short-chain and aldo-keto reductases, and hydrolases, including lipases and carboxylesterases.

Identifying split and full-length homologs of specific drug-metabolizing genes

In many cases, we know the specific substrates on which drug-metabolizing enzymes act. We therefore used the database PharmGKB (Whirl-Carillo et al. 2012, 2021) to identify cases where human proteins with gut homologs were known to be involved in the metabolism of either a pharmaceutical drug or one of its downstream metabolites.

Out of 154 proteins in PharmGKB with reviewed entries in UniProt, we found that a large majority (126/154, 82%) had at least 1 full-length or split homolog. 97% of these (122/126) had more full-length than split gut homologs; this set of proteins metabolized 215 drugs in total (Supplementary Table 9). Consistent with the sparse distribution we observed above, cytochromes were the most under-represented category in this list, with only 8 genes found to have gut microbial homologs compared to 22 in PharmGKB as a whole. In contrast, 12 out of 13 drug-metabolizing UDP-glucuronosyl-transferases and all 9 aldo-keto reductases were found to have full-length gut homologs.

Interestingly, despite the strong enrichment we observed for xenobiotic metabolism among proteins with full-length homologs, <50% of these proteins (56/122) fell into 1 of the 10 classes listed above. While many different types of enzymes were represented among the remainder, a plurality of 36 were annotated in GO as metabolizing nucleobase-containing compounds. This is consistent with the observation that this process was enriched among both full-length and split homologs. Furthermore, nucleobase-containing analogs are some of the most common human chemotherapeutic, immunomodulatory, and antiviral drugs, and their metabolism is also well-studied, as variants that affect their metabolism have large consequences for health. Finally, of the remaining 30 genes, 20 were annotated in GO as oxidoreductases, further underscoring the importance of redox-active genes in gut microbial metabolism.

When we instead kept only cases with more split than full-length homologs, we found 4 genes involved in the metabolism of 12 drugs (Table 4). Again, 3 of these genes were involved in nucleotide metabolism, and many of these drugs were antimetabolite chemotherapeutics such as thioguanine, doxorubicin, and mercaptopurine. We noted that 2 out of 4 genes metabolized 5-fluorouracil (dihydropyrimidine dehydrogenase, or DPYD; and uridine monophosphate synthase, or UMPS). These results indicate that both full-length and split gut homologs may play roles in the microbial transformation of nucleoside analogs.

The genes DPYD and xanthine dehydrogenase (XDH) had among the most split homologs. In the case of XDH, human gut microbes have been shown to catabolize purines such as hypoxanthine, a typical substrate of human XDH, leading to alterations in the host purine pool (Kasahara et al. 2023). In addition, at least some microbial XDHs can also perform reduction reactions (Rabinowitz and Barker 1956), allowing them to convert urate to xanthine. Urate degradation by C. sporogenes is abolished in an XDH deletion, and other XDH-containing gut microbes have been shown in labeling experiments to reduce uric acid to xanthine (Liu et al. 2023). This potential conservation of function supports a possible role for XDH homologs in the metabolism of purine analog drugs, such as azathioprine (AZA), a chemotherapeutic and immunosuppressive drug that is given orally. Indeed, it has recently been shown in an in vivo preclinical model that Blautia wexlerae weakens the therapeutic effect of AZA, likely by metabolizing its active metabolite, 6-mercaptopurine (6MP) into the inactive form, 6-TX; furthermore, this metabolism can be interrupted with the XDH inhibitor allopurinol (Yan et al. 2023).

Recent publications also support that both the human DPYD protein and its bacterial counterparts, PreT and PreA (encoded by the preTA operon) can inactivate the chemotherapeutic drug 5-fluorouracil (5-FU) in vivo. While 5-FU itself is not given orally, the orally available prodrug, capecitabine, can also be activated to 5-FU by host liver enzymes as well as select gut bacterial strains (Spanogiannopoulos et al. 2022). Indeed, in a mouse model of colorectal cancer treatment with 5-FU, mice monocolonized with a preTA knockout strain of E. coli had better survival than those monocolonized with a preTA overexpression strain (Spanogiannopoulos et al. 2022).

Discussion

We conducted a systematic survey of homology between the human and gut microbial proteomes. This analysis included both full-length and “split” homologs. We found that around 1 in 10 human proteins (2.6K) had at least some homolog in the gut microbial proteome, and that 23 human proteins had primarily split homologs.

Among human proteins with full-length homologs, xenobiotic metabolism was the most enriched process, and many different xenobiotic enzyme classes were found in gut genomes. However, the most predominant systems in humans (cytochrome P450s, glutathione-S-transferases) were relatively rare in the gut microbiome. Instead, reductases and hydrolases were the most common, especially among the most prevalent microorganisms. This is consistent with previous observations about types of drug metabolism engaged in by the gut microbiome (Spanogiannopoulos et al. 2016; Koppel et al. 2017), and builds on these observations by enumerating specific classes of enzymes that are likely to contribute. Proteins involved in central metabolism, especially of nucleoside-containing compounds, were also commonly found as both full-length and split homologs in the gut microbiome. Finally, in 2 cases (6MP and 5-FU), the gut microbial split homologs we identified have been shown to metabolize pharmaceutical nucleoside analogs in mouse models (Spanogiannopoulos et al. 2022; Yan et al. 2023).

One limitation of this work is that we have only considered the gut microbiome. This community was our focus because microbial biomass is highest in the gut (Sender et al. 2016), because orally ingested drugs are absorbed in the intestine (Price and Patel 2024), and because the gut is closely connected to the liver, the primary site of drug metabolism in humans (Hsu and Schnabl 2023). However, split orthologs in skin microbes may also be relevant for topically applied drugs, and similarly for oral microbes and drugs delivered as rinses.

While our focus was on drug and xenobiotic metabolism, our results include matches to the entire canonical human proteome. Furthermore, the approach we used to generate a map of host-microbial homology, especially one that includes split homologs, is novel. Our code and results may therefore be helpful to microbiome researchers more broadly, and with this in mind, our code and results are both available as publicly available resources (see Data Availability). The pipeline we developed, written using Snakemake (Mölder et al. 2021), requires the following inputs: (1) protein sequences from the host proteome, (2) protein family sequences from the microbiome, (3) a mapping of microbial gene IDs to protein family IDs, (4) genome feature files for each microbial genome, (5) taxonomic classifications for each genome in the microbiome, and (6) EggNOG (Huerta-Cepas et al. 2019) annotations for microbiome protein families (obtained using eggNOG-mapper; Cantalapiedra et al. 2021). By default, the pipeline downloads the consensus human proteome from UniProt (The UniProt Consortium 2021), and downloads the human gut microbial genome and proteome catalogs from MGnify Genomes, with proteins clustered at 90% identity [2]. However, the pipeline could readily be adapted for an alternative microbiome and/or host: for example, the MGnify Genomes (Gurbich et al. 2023) database includes oral and vaginal human microbiome catalogs in the same format as UHGG/UHGP, as well as gut microbiome catalogs for zebrafish and honeybee.

A technical limitation of this work is that sequencing and annotation errors can give rise to in silico, artefactual gene “fusions” or “fissions.” We believe that the way that UHGP-90 protein clusters were constructed would favor such “fusions.” UHGP-90 protein clusters were constructed using MMSeqs2's “linclust” algorithm (Steinegger and Söding 2017) in target-coverage mode, meaning that the representative sequence for a cluster must cover 80% of each member sequence, but not necessarily vice versa. This has the advantage that protein fragments or artefactual “fissions” would seldom be chosen as representative sequences, but also means artefactual “fusions” would be chosen more often. Since we use the representative sequences in this pipeline, this effect would therefore bias us away from detecting split homologs.

Of course, it is important to emphasize that homologs may differ in substrate specificity. This is especially true over long evolutionary distances (e.g. between humans and microorganisms) and for enzymes whose substrate specificity is broad (e.g. many xenobiotic metabolism genes). Follow-up experiments would therefore be necessary to establish whether specific substrates are shared between host and microbial homologs. Advances in computational structural biology, such as improvements to high-throughput ligand docking tools (Corso et al. 2023; Abramson et al. 2024; Krishna et al. 2024), may also help prioritize homologs that could contribute to parallel drug metabolism between host and microbiome. We speculate that interactions between central metabolic enzymes and substrate analogs, such as the chemotherapeutics 6MP and 5-FU, may be especially likely to translate: these enzymes are more evolutionarily constrained than broad-spectrum xenobiotic enzymes (Noda-Garcia and Tawfik 2020), and the corresponding drugs bind in the active site, which is typically highly conserved.

While we have focused on proteins that metabolize drugs, the protein targets of drugs could also be conserved, potentially causing off-target effects on the microbiome. Such unintended effects of pharmaceuticals on gut microorganisms are not rare: a study of more than 1,000 marketed drugs found that nearly a quarter inhibited the growth of at least 1 of 40 representative gut isolates (Maier et al. 2018). As above, we would expect host-microbiome homology to be especially relevant when considering proteins targeted by substrate analogs. Indeed, a study of the chemotherapeutic 5-FU showed that it had large effects on gut microbial growth (Spanogiannopoulos et al. 2022), and antimetabolites as a class were also enriched for antimicrobial effects in the study above (Maier et al. 2018).

Having identified strong candidates for homologs of human genes in gut microbes, a final remaining question concerns their evolutionary history. Which of these homologs are most likely descendants of ancestral sequences present in the last universal ancestor, and which might be better explained by horizontal gene transfer? Transfers from bacteria into early eukaryotes (Méheust et al. 2018), especially from the ancestors of modern organelles (Timmis et al. 2004), as well as transfers from modern eukaryotes into bacteria (Mondino et al. 2020), could lead to both full-length and split homology. One route forward would be to construct phylogenetic trees for the homologous proteins we identify here, and then compare these to the corresponding species trees. Of course, ancient events are intrinsically difficult to resolve. However, recent studies (e.g. Coleman et al. 2021) suggest that the gain in resolution afforded by the post-sequencing wealth of microbial genome data, combined with continued advances in phylogenetic methods (e.g. algorithms that improve gene- to species-tree reconciliation for multidomain genes (Stolzer et al. 2015), which may be particularly relevant for split homologs) may allow us to distinguish between different potential histories.

Supplementary Material

jkaf131_Supplementary_Data

Bibliography80

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abdill RJ, Graham SP, Rubinetti V, Ahmadian M, Hicks P, Chetty A, Mc Donald D, Ferretti P, Gibbons E, Rossi M, et al 2025. Integration of 168,000 samples reveals global patterns of the human gut microbiome. Cell. 188(4):1100–1118.e 17. doi:10.1016/j.cell.2024.12.017.39848248 PMC 11848717 · doi ↗ · pubmed ↗
2Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, et al 2024. Accurate structure prediction of biomolecular interactions with Alpha Fold 3. Nature. 630(8016):493–500. doi:10.1038/s 41586-024-07487-w.38718835 PMC 11168924 · doi ↗ · pubmed ↗
3Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL, Hill DP, et al 2023. The gene ontology knowledgebase in 2023. Genetics. 224(1):iyad 031. doi:10.1093/genetics/iyad 031.36866529 PMC 10158837 · doi ↗ · pubmed ↗
4Alexa A, Rahnenfuhrer J. 2024. top GO: Enrichment analysis for Gene Ontology. Version 2.54.0. doi:10.18129/B 9.bioc.top GO. · doi ↗
5Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, Pollard KS, Sakharova E, Parks DH, Hugenholtz P, et al 2021. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 39(1):105–114. doi:10.1038/s 41587-020-0603-3.32690973 PMC 7801254 · doi ↗ · pubmed ↗
6Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215(3):403–410. doi:10.1016/S 0022-2836(05)80360-2.2231712 · doi ↗ · pubmed ↗
7Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al 2000. Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25–29. doi:10.1038/75556.10802651 PMC 3037419 · doi ↗ · pubmed ↗
8Bradley PH, Pollard KS. 2017. Proteobacteria explain significant functional variability in the human gut microbiome. Microbiome. 5(1):36. doi:10.1186/s 40168-017-0244-z.28330508 PMC 5363007 · doi ↗ · pubmed ↗