ASMC: investigating the amino acid diversity of enzyme active sites

Thomas Bailly; Eddy Elisée; David Vallenet

PMC · DOI:10.1093/bioinformatics/btaf307·May 15, 2025

ASMC: investigating the amino acid diversity of enzyme active sites

Thomas Bailly, Eddy Elisée, David Vallenet

PDF

Open Access

TL;DR

This paper introduces an updated version of the ASMC workflow, which improves enzyme active site analysis for better clustering and easier use.

Contribution

The paper presents a redesigned ASMC workflow with improved methods for predicting enzyme active site clusters.

Findings

01

The new ASMC version includes updated pocket prediction and clustering methods for better results.

02

The workflow now uses a single programming language (Python) for easier maintenance and installation.

03

Evaluation on protein families showed improved performance compared to the original version.

Abstract

The analysis of enzyme active sites is essential for understanding their activity in terms of catalyzed reaction and substrate specificity, providing insights for engineering to obtain targeted properties or modify the substrate scope. In 2010, a first version of the Active Site Modeling and Clustering (ASMC) workflow was published. ASMC predicts isofunctional clusters from enzyme families, based on structural modeling and clustering of active sites. Since then, structure- and sequence-based methods have developed considerably. We present here a redesign of the ASMC workflow. This new major version includes recent pocket prediction, structural alignment and clustering methods, as well as a refined amino acid distance matrix, thereby improving the relevance of results and reducing the need for laborious manual analysis to obtain relevant clusters. In addition, we have implemented…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

amino acid

Figures1

Click any figure to enlarge with its caption.

Schematic view of the updated ASMC workflow. Each colored frame represents a way to run ASMC: (A) whether the active site(s) is/are unknown or (B) known, (C) whether the 3D models are already available, (D) whether users need to re-cluster or sub-cluster a set of aligned active sites, or (E) whether the input is a multiple sequence alignment (2D-based approach). Software names are written in bold.

Funding1

—French National Research Agency10.13039/501100001665

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMicrobial Metabolic Engineering and Bioproduction · Protein Structure and Dynamics · Enzyme Structure and Function

Full text

1 Introduction

Identifying the residues involved in the reaction mechanism or substrate specificity is crucial for the study of enzyme families, as it helps define isofunctional subfamilies and provides insights for engineering enzymes to achieve targeted properties or modify their substrate scope. Over the years, various computational methods have been developed for the functionally relevant clustering of proteins, dealing with either hidden Markov models [SCI-PHY (Brown et al. 2007); GeMMA (Lee et al. 2010); FunFHMMer (Das et al. 2015)], phylogeny [AutoPhy (Ortiz-Velez et al. 2024)], active site profiles [DASP (Fetrow 2006); MISST (Harper et al. 2017)], genetic algorithms (de Lima et al. 2016), or 3D structures [TuLIP (Knutson et al. 2017)]. In turn, the original Active Site Modeling and Clustering (ASMC) method (de Melo-Minardi et al. 2010) relies on homology modeling and active site classification to identify specificity determining positions and determine isofunctional groups. Over the years, ASMC has been used by our team and collaborators to decipher the BKACE (Bastard et al. 2014), MetA/MetX (Bastard et al. 2017), CSL dioxygenase (Bastard et al. 2018), and AmDH (Mayol et al. 2019) protein families. We recently published a refined picture of the latter family (Elisée et al. 2024) in which we used a different reference structure (CfusAmDH instead of AmDH4) for the ASMC analysis, which gave a better definition of the active site. Despite being available only upon request from the authors, ASMC has been cited approximately fifty times. Therefore, we believe that an open-source version would greatly benefit the scientific community. In addition, with advancements in the development of structure- and sequence-based methods, several components of the ASMC workflow could be improved. Furthermore, the hierarchical clustering method used by ASMC [COBWEB (Fisher 1987) within WEKA software (Holmes, Donkin, and Witten)] does not consider amino acid properties. This oversight can result in an excessive number of predicted subfamilies, necessitating manual investigation to ultimately obtain relevant clusters where active sites with chemically equivalent residues are grouped.

We present here a redesign of ASMC. In this new major version, we have updated the workflow by (i) unifying the code from three to one programming language (Python) to facilitate its installation and maintenance, (ii) implementing recent methods about pocket prediction, structural alignment and clustering with a refined amino acid distance matrix, (iii) making homology modeling with several reference structure or accepting external models provided by the user, (iv) using multiple sequence alignment (MSA) input for the clustering, and other features detailed hereafter. We have evaluated ASMC on protein families mentioned above, resulting in overall better performances compared to its original version. Finally, ASMC is now freely available as an open-source software.

2 Methods and implementation

ASMC is a precision tool to classify an enzyme family highlighting the diversity of amino acid residues making up the active site.

2.1 Architecture

The ASMC workflow has been improved for each of its four constituent steps described hereafter.

3D modeling: for each target sequence of the family, one reference structure is used as a template to build homology models through MODELLER (Webb and Sali 2016)—if several reference structures are provided, the one with the best sequence identity is selected. This value should be at least 30% for a target sequence to be modeled (adjustable threshold). Prior to 3D modeling, a pairwise structural alignment between the target and its reference is generated using the SALIGN module (Madhusudhan et al. 2009). To reduce computation time, two models are built for each target and the one with the lowest DOPE score is selected. This step can be skipped if the user already has a set of 3D structures, e.g. taken from AI-based databases (Lin et al. 2023, Varadi et al. 2024), though care must be taken to ensure pocket accuracy for reliable active site identification and comparison.

Ligand-binding pocket search: based on carefully chosen reference structure(s) (preferentially holo than apo form), the potential active site pocket is detected using P2RANK (Krivák and Hoksza 2018). It uses machine learning to rapidly compute a ligandability score for pockets, defined from points distributed along the solvent-accessible protein surface. The user can then select one of the predicted pockets as reference (using P2RANK manually) or let the algorithm choose the one with the highest score.

Structural alignment: each homology model is aligned on its reference structure using US-align (Zhang et al. 2022). The resulting 3D sequence alignment is then used to retrieve the residue positions defined for the corresponding reference pocket. The same protocol is used for the 2D-based approach when an MSA is given as input.

Clustering: all active site sequences undergo an all-vs-all comparison to generate a score matrix, using a purpose-designed distance matrix for comparing active site residues (see Note 1, available as supplementary data at Bioinformatics online). The score matrix, scaled between 0 and 1, is then used by DBSCAN (Ester et al. 1996) to perform density-based spatial clustering guided by two parameters: eps, a distance measure to identify neighboring points, and min_samples, the minimum number of points for a region to be considered dense. The user can either let the algorithm define these values (see Note 2, available as supplementary data at Bioinformatics online) or play with their own.

2.2 Implementation

ASMC is Python-based (version ≥3.8), currently supported on Linux (tested on Fedora 40, CentOS 7, Ubuntu 24), and distributed on the GitHub platform as an open-source software. It requires four other easy-to-install bioinformatics tools (MODELLER, P2RANK, US-align) and some Python libraries: plotnineseqsuite (Cao et al. 2023), biopython (version ≥1.81), numpy, scikit-learn, pyyaml, pillow. Moreover, a Docker image is available to ease the installation.

The ASMC workflow can be launched in various ways depending on the user's needs (Fig. 1). Sequences and structures must be in FASTA and PDB format, respectively. Additional scripts are also available to aid in the analysis of the results.

Schematic view of the updated ASMC workflow. Each colored frame represents a way to run ASMC: (A) whether the active site(s) is/are unknown or (B) known, (C) whether the 3D models are already available, (D) whether users need to re-cluster or sub-cluster a set of aligned active sites, or (E) whether the input is a multiple sequence alignment (2D-based approach). Software names are written in bold.

2.3 Benchmarking

Protein families previously analyzed using the original ASMC software were used to assess the relevance of the updated workflow's results, in terms of cluster composition and sequence logos. These datasets comprise the BKACE (725 sequences), MetA (2661 sequences), MetX (4277 sequences), and AmDH (9886 sequences) families.

3 Results

The updated version of ASMC automatically generates clusters consistent with those published and for which manual adjustments had been necessary (see Figs 1–4, Table 1, and Note 3, available as supplementary data at Bioinformatics online). This improved performance is primarily attributed to the distance matrix we developed, which enhances clustering of active site residues by more effectively accounting for their physicochemical similarities and differences.

ASMC was also able to extract active site sequences from the “non-active” groups of BKACE, AmDHs, and MetX (Figs 1–3, available as supplementary data at Bioinformatics online), which is noteworthy given the presence of residues that are also involved in the active groups. Remarkably, ASMC succeeded in identifying the G3 group of the AmDH family (∼50 sequences among 9k+), which previously had to be manually curated. Finally, MetA clusters were sufficiently homogeneous to be separated by ASMC under specific residue positions (Fig. 4, available as supplementary data at Bioinformatics online).

This new version of ASMC offers a concise overview of the amino acid diversity of active sites by greatly reducing the number of predicted clusters while providing, unlike its original version, a re-clustering option to further subdivide and explore groups of interest. This was performed, e.g. for the largest groups in the BKACE and MetX families and revealed sub-clusters similar to those previously published, as well as new clusters with a refined amino acid profile (Figs 5 and 6, available as supplementary data at Bioinformatics online).

Experience has shown that ASMC clusters are more accurate when family entries share >30% of identity with reference structure protein sequences. Below this value, analysis should be carried out with caution and we recommend cross-checking with a MSA-based workflow to compare the results.

In terms of performance, the ASMC speed is limited by the homology modeling step, which logically depends on the number of target sequences. Reducing the number of models per target sequence from fifty to two (the default option value) had no impact on the overall results for the protein families studied, thus saving time.

4 Conclusions

ASMC has been updated to classify homologous enzyme sequences based on 2D or 3D active site residues, and to facilitate comparison of both approaches in order to evaluate the relevance of the active site profiles. Its use has been simplified and reduced to a single command line, offering numerous options for running ASMC according to the user’s needs. We believe this user-friendly version will be useful for quickly inspecting enzyme active sites and, e.g. getting clues about residues that might be critical regarding substrate specificity, with the support of docking analysis.

Supplementary Material

btaf307_Supplementary_Data

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bastard K , Isabet T, Stura EA et al Structural studies based on two lysine dioxygenases with distinct regioselectivity brings insights into enzyme specificity within the clavaminate synthase-like family. Sci Rep 2018;8:16587.30410048 10.1038/s 41598-018-34795-9PMC 6224419 · doi ↗ · pubmed ↗
2Bastard K , Perret A, Mariage A et al Parallel evolution of non-homologous isofunctional enzymes in methionine biosynthesis. Nat Chem Biol 2017;13:858–66.28581482 10.1038/nchembio.2397 · doi ↗ · pubmed ↗
3Bastard K , Smith AAT, Vergne-Vaxelaire C et al Revealing the hidden functional diversity of an enzyme family. Nat Chem Biol 2014;10:42–9.24240508 10.1038/nchembio.1387 · doi ↗ · pubmed ↗
4Brown DP , Krishnamurthy N, Sjölander K et al Automated protein subfamily identification and classification. P Lo S Comput Biol 2007;3:e 160.17708678 10.1371/journal.pcbi.0030160 PMC 1950344 · doi ↗ · pubmed ↗
5Cao T , Li Q, Huang Y et al plotnine Seq Suite: a Python package for visualizing sequence data using ggplot 2 style. BMC Genomics 2023;24:585.37789265 10.1186/s 12864-023-09677-8PMC 10546746 · doi ↗ · pubmed ↗
6Das S , Sillitoe I, Lee D et al CATH Fun FHM Mer web server: protein functional annotations using functional family assignments. Nucleic Acids Res 2015;43:W 148–53.25964299 10.1093/nar/gkv 488PMC 4489299 · doi ↗ · pubmed ↗
7de Lima EB, Meira Jr. W, de Melo-Minardi RC. Isofunctional protein subfamily detection using data integration and spectral clustering. P Lo S Comput. Biol 2016;12:e 1005001.27348631 10.1371/journal.pcbi.1005001 PMC 4922564 · doi ↗ · pubmed ↗
8de Melo-Minardi RC , Bastard K, Artiguenave F et al Identification of subfamily-specific sites based on active sites modeling and clustering. Bioinformatics 2010;26:3075–82.20980272 10.1093/bioinformatics/btq 595 · doi ↗ · pubmed ↗