ActSeek: fast and accurate search algorithm of active sites in alphafold database
Sandra Castillo, Osmo Henri Samuli Ollila

TL;DR
ActSeek is a fast and accurate tool that searches for proteins with similar active sites in the Alphafold database, helping find enzymes for biodegradable plastics and drug off-targets.
Contribution
Introduces ActSeek, a novel computer vision-inspired algorithm for searching active site similarities in structural databases.
Findings
ActSeek can identify enzymes for producing or degrading biodegradable plastics.
The tool helps find potential off-targets for common drug molecules.
ActSeek is implemented for fast and accurate mining of the Alphafold database.
Abstract
Finding proteins with specific functions by mining modern databases can potentially lead to substantial advancements in wide range of fields, from medicine and biotechnology to material science. Currently available algorithms enable mining of proteins based on their sequence or structure. However, activities of many proteins, such as enzymes and drug targets, are dictated by active site residues and their surroundings rather than the overall structure or sequence of a protein. We introduce ActSeek—a computer vision-inspired fast program—that searches structural databases for proteins with active sites similar to the seed protein. ActSeek is implemented to mine proteins with desired active site environments from the Alphafold database. The potential of ActSeek to find innovative solutions to the world’s most pressing challenges is demonstrated by finding enzymes that may be used to…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5| Method | Erlotinib | Sorafenib | Beta-blockers | Erlotinib | Sorafenib | Beta-blockers |
|---|---|---|---|---|---|---|
| ActSeek |
|
|
|
|
|
|
| BLAST | 471 | 1000 | 674 | 132 | 7 | 23 |
| FoldSeek | 1453 | 1183 | 1695 | 21 | 0 | 2 |
| Strucmotif-search | 3935 | 2479 | 10 | 58 | 8 | 4 |
| Folddisco | 194 | 233 | 1737 | 33 | 8 | 18 |
| pyScoMotif | 132 | 2045 | 6 | 21 | 10 | 6 |
| ProBis | 249 | 305 | 13 | 56 | 6 | 4 |
- —Research Council of Finland for financial support
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Genetics, Bioinformatics, and Biomedical Research · Protein purification and stability
1 Introduction
Finding proteins with desired functions by mining databases is becoming increasingly attractive due to the growing amount of data and emerging computational tools. In addition to the Uniprot database (UniProt Consortium 2022) containing protein sequences, also in silico predicted protein structures are available for mining (Jumper et al. 2021, Varadi et al. 2024). These databases open new avenues for searching proteins based also on structural similarity, in addition to traditional sequence-based approaches using BLAST or similar algorithms (Altschul et al. 1990). Such algorithms are available, but their computational efficiency is often a limiting factor for large-scale structure-based searches (Meysman et al. 2015, Cia et al. 2023, van Kempen et al. 2024, Burley et al. 2025).
A significant improvement in performance was presented in the recent FoldSeek algorithm (van Kempen et al. 2024), which enabled structure-based searches over a large number of structures available in databases. However, the activities of some relevant proteins, such as enzymes and drug targets, are dictated by several residues in the proximity of active sites rather than the overall structure of a protein. Therefore, searches based on full structure may result in proteins whose overall structure resembles the seed protein, but have critical differences at the active site region. On the other hand, proteins with a similar active site to the seed protein but with some irrelevant structural differences may not be found.
Here we present ActSeek, a program that searches structural databases for proteins with a similar active site to the seed protein. Our fast algorithm, inspired by computer vision (Arun et al. 1987, Golub and Van Loan 2013), enables searches over large databases based on the coordinates of specific amino acids. Here we have implemented ActSeek to search over the Alphafold database (Varadi et al. 2024).
Finding proteins with desired active sites and functions can benefit a wide range of fields, from medicine and biotechnology to material science. Here we demonstrate the potential of ActSeek to find (i) proteins that are able to produce biodegradable plastics with enhanced properties (polyhydroxyalkanoate (PHA) synthases) (Poltronieri and Kumar 2017, Diniz et al. 2023), (ii) new proteins to degrade poly(ethylene terephthalate) (PET) based plastics (depolymerases PETase and MHETase) bearing potential in recycling applications (Maity et al. 2021), and (iii) potential off-targets for common drugs. These examples demonstrate the potential of protein mining based on active site structures in providing innovative solutions to the world’s most pressing challenges.
While other methods can find structural motifs with high speed, as ActSeek does (Steinkellner et al. 2014, Bittrich et al. 2020, Shi et al. 2024), they are limited to smaller databases such as the RCSB PDB (Berman et al. 2000) or BRENDA (Chang et al. 2021) due to their requirement of indexing the enzyme databases prior to the search. This limitation makes it difficult to search larger or customized protein databases. Therefore, we focus here on comparing ActSeek with widely used tools for protein mining in large databases, such as Blast and FoldSeek. Nevertheless, for drug target searches in already indexed Human proteome, we compare the performance of ActSeek also with Strucmotif-search (Bittrich et al. 2020), Folddisco (https://github.com/steineggerlab/folddisco), pyScoMotif (Cia et al. 2023), and ProBis (Konc and Janežič 2010) motif search algorithms.
2 Materials and methods
2.1 Defining the relevant amino acids in the active site
For a start, the user needs to define the binding site in the seed protein by providing cartesian coordinates or positions of the three most important amino acids. If less than three amino acids are given to the search, ActSeek will search all the amino acids within of specific radius from the first amino acid the user has given, and use the first 3 amino acids for the search. These amino acids will be used to align the query structure with the seed. The user can also define additional amino acids that are relevant for the active site, which will be used for ranking when selecting the best query protein. Also, neighboring amino acids in the sequence (four by default, but can be reset by the user) to the selected amino acids (binding site and additional ones) will be used for the ranking. Coordinates of C^α^ and C^β^ atoms in the amino acids will be used in the searches. The user can define whether amino acid types in the local structure of the query need to exactly correspond to the seed ones, or define a list of similar amino acids that are accepted at each position.
2.2 Pre-screening
The ActSeek algorithm first downloads the query structure from the database and checks if the three main amino acids defined by the user are present in the sequence. Before comparing local 3D configurations with the seed, ActSeek filters potential matches by comparing inter-amino acid distances. To this end, distances between the amino acids from all possible sequential combinations are calculated. Only if all inter-amino acid distances deviate <3 Å from the values in the seed structure, the structure with the given mapping will be considered further. Notably, several possible mappings may be found for a single structure. If the number of possible combinations exceeds 2000 (can be reset by the user), this number of combinations is randomly selected for testing.
2.3 Finding the optimal rotation and translation matrices
Local 3D structures that satisfy the condition for binding site inter-amino acid distances in the pre-screening step are then further compared with the seed. This is done by finding the optimal rotation and translation matrices to overlay the binding site amino acid configuration with the seed using singular value decomposition (SVD) and least square fitting (Arun et al. 1987).
2.4 Selecting the best query proteins
After performing Euclidean transformations with the optimal matrices, similarities of active sites between query and seed proteins are measured by calculating distances between relevant amino acids. To this end, we first calculate the average distances for selected amino acids (binding site and additional ones) between a query structure and the seed. Additional amino acids are counted only if the corresponding type is found in the query protein (exact match or correspondence defined by the user). In addition, the same distances are calculated for neighboring amino acids (set by user or 4 by default) around each relevant amino acid. Distances between neighboring amino acids are used regardless of the types of these amino acids in the query and seed.
The score for each possible mapping of amino acids in the given structure is then calculated as an average of the distances of mapped amino acids divided by the number of amino acids over which the average is taken (to penalize situations where additional amino acids defined by the user are not found from the query). The combination of mapped amino acids with the lowest score is then selected for the given structure. Finally, the given structure is selected if its lowest score was less than user-defined value (default 1 Å).
2.5 Other calculations
ActSeek calculates additional measures to assess the structural similarity between seed and query proteins, which can help filter the results further.
Protein alignment percentage: The alpha carbons of the query structure are aligned with those of the seed protein by optimally pairing them using dynamic programming (Sniedovich 2010). Only pairs with a distance below 2 Å are considered and used to calculate the percentage of aligned amino acids in the query structure. This value is given as a “Structural mapping percentage” in the result file.Local protein alignment: We calculate a weighted average of the number of 5, 10, 15, and 20 consecutive amino acid pairs that are aligned in both structures. This value is given by ActSeek as “Structural local similarity” in the result file. This measure indicates similarities of substructures that are aligned with the seed protein, without considering the rest of the protein that might differ significantly.Cavity Comparison: To compare active site cavities between query and seed proteins, cavities are first recognized using the pyKVFinder (Guerra et al. 2021) algorithm. Cavities between seed and query are then compared based on the above-described protein alignment. We also provide the percentage of amino acids aligned in the cavity compared to the total amino acids. The result file provides these measures as “Cavity distances,” “Cavity mapping (case: seed),” and “Cavity mapping percentage.” This feature is optional because it reduces the program’s performance during extensive searches.
2.6 Implementation and running
The algorithm supports multiprocessing, and it is designed for computer clusters, though it can run on a single computer for smaller searches. Each running node in the cluster with 24 CPUs would process 10 000 proteins from the database in 2–10 min, with the exact duration depending on the average number of possible mappings to test. A search in a database containing 200 million proteins would take <6 h using 100 nodes. ActSeek is written using Python. The code can be found at https://github.com/vttresearch/ActSeek
2.7 Blast searches and FoldSeek searches
We used Uniprot Knowledgebase TrEMBLE (UniProt Consortium 2022), containing around 253 million sequences, as a database for the Blast search. The Evalue threshold was set to 10. We used the Blastp algorithm (Altschul et al. 1990) for the protein search. For the FoldSeek (van Kempen et al. 2024) search, we used the Alphafold/UniProt50 database (Varadi et al. 2024) and ran it with the default parameters.
2.8 Structural motif search algorithm searches
All tested structural motif search software tools for identifying potential drug off-targets in the human proteome were used with default parameters. Strucmotif-search (Bittrich et al. 2020) was performed using a web server available at https://www.rcsb.org/search/advanced. Filtering was not possible for Strucmotif-search because the program provides only a score-sorted list of PDB proteins without detailed mapping information. ProBis search (Konc and Janežič 2010) was performed using the web servers available at http://ProBis.cmm.ki.si/, and results were filtered using their Z-value metric, keeping only proteins with a Z-value greater than 2. Folddisco code was downloaded from https://github.com/steineggerlab/folddisco and used locally. Results were filtered by keeping only those with all three main amino acids mapped. PyScoMotif (Cia et al. 2023) code was downloaded from https://github.com/3BioCompBio/pyScoMotif and used locally. Results were filtered by keeping only those with the RMSD value of 1.0 or lower.
2.9 ActSeek search examples
2.9.1 PHA synthases
We used the PHA synthases Class I from Chromobacterium violaceum (uniprot: Q9ZHI2) as the seed, and its three catalytic amino acids 291Cys, 447Asp, and 477His (Chek et al. 2017) as the sites for the ActSeek search. Additional amino acids defined for the search were 393Trp and 477Gly. Default values were used for other ActSeek parameters. The search was performed through the Alphafold database (Varadi et al. 2024) for proteins that are annotated with -hydrolase fold in the Uniprot database (UniProt Consortium 2022) (total 2 567 710 enzymes) because all PHA synthases are expected to have such a fold. We filtered the data based on local protein alignment score because we did not anticipate significant structural divergence in this case. When the mapping included only the three main catalytic amino acids (additional amino acids not found), we set the threshold for local protein alignment score to 0.4. When additional amino acids were included in the mapping, we lowered the threshold to 0.1. After filtering, we also removed any proteins that had been deleted from UniProt during the course of our study.
2.9.2 PETase
We used poly(ethylene terephthalate) (PET) depolymerase from Piscinibacter sakaiensis (Ideonella sakaiensis) as the PETase seed (Uniprot: A0A0K8P6T7), and its three catalytic amino acids 160Ser, 206Asp, and 237His (Joo et al. 2018) as the sites for the ActSeek search. Additional amino acids defined for the search were 87Tyr and 185Trp. Also, phenylalanine was accepted for position 87 and tyrosine for position 185. Five neighboring amino acids around each selected amino acid were used for the ranking (instead of the default value of four). The search was performed for proteins annotated with -hydrolase fold in the Uniprot database (UniProt Consortium 2022) (total 2 567 710 enzymes) as PETases are known to have the -hydrolase fold (Joo et al. 2018).
2.9.3 MHETase
We used protein that catalyzes the hydrolysis of mono(2-hydroxyethyl) terephthalate (MHET) as the MHETase seed (uniprot: A0A0K8P8E7), and its amino acids 225Ser, 492Asp, and 528His as the sites for the search (Knott et al. 2020). Additional amino acids for the search were 224Cys and 529Cys. Default values were used for other ActSeek parameters. The search was performed in proteins annotated with -hydrolase fold in the Uniprot database (UniProt Consortium 2022) (total 2 567 710 enzymes), as MHETases are known to have the -hydrolase fold.
2.9.4 Erlotinib targets
We performed a search using the target of the drug Erlotinib as a seed (Epidermal growth factor receptor, Uniprot P00533). For the search, we used important amino acids for the binding of the drug, such as 718Leu, 790Thr, and 855Asp (Eigenbrot 2008) without additional ones. Also, methionine or isoleucine was accepted at position 790. After the search, we filtered the results by removing hits with the local protein alignment score lower than 0.4. We searched all the human proteins listed in Uniprot (20,417 proteins).
2.9.5 Sorafenib targets
We performed a search using the Sorafenib target B-Raf kinase as a seed (Uniprot P15056), and its amino acids 529Thr, 532Cys, and 536Ser as the sites used for the search without any additional amino acids. Also, valine was accepted at position 529 and asparagine at position 536. We searched all the human proteins listed in Uniprot (20,417 proteins).
2.9.6 Beta-blocker targets
We used the native receptor of these drugs, the beta2 adrenergic receptor (Uniprot id: P07550), as the seed. The main amino acids used for the search were 113Asp, 290Phe and 312Asn and additional ones were 114Val and 289Phe (Rasmussen et al. 2007), which are known to be involved in the drug binding. We allowed amino acids of the same type (i.e., uncharged polar, charged, hydrophobic, and aromatic) to be interchangeable during the search. We searched all the human proteins listed in Uniprot (20,417 proteins).
3 Results
3.1 ActSeek algorithm
In contrast to BLAST (Altschul et al. 1990) and FoldSeek (van Kempen et al. 2024), which are based on the overall sequence or structure of proteins, ActSeek finds proteins with similar local structure to the active site of a seed protein (Fig. 1A). Here, we implement the search from the Alphafold database (Varadi et al. 2024), yet the algorithm can be applied to other databases with 3D structures of proteins under interest.
(A) Structure of an example seed protein with active site amino acids marked in red. (B) Query proteins are taken from a dataset [alphafold database (Varadi et al. 2024) in this work]. (C) Optimal rotational and translational matrices found with SVD and least square fitting (Golub and Van Loan 2013). (D) Best targets selected based on local structure.
To initiate the ActSeek search, the user needs to first provide the structure of the seed protein and define the relevant amino acids for the active site and its surroundings (Fig. 1A). Alternatively, the user can provide the coordinates of the alpha and beta carbons of the relevant amino acids. In the case that less than three amino acids are provided, ActSeek will identify all the amino acids within a defined radius from the initial amino acid provided by the user, conducting the search based solely on their spatial coordinates, irrespective of their identities. Three most relevant amino acids defined by the user, corresponding for example catalytic or binding site, will be used to compute the optimal rotational and translational matrices for structural alignment of the seed and query proteins using singular value decomposition (SVD) and least square fitting (Golub and Van Loan 2013) (Fig. 1C). After euclidean transformation using these matrices, query proteins found from a given database are ranked based on distances between amino acids in the seed and the query structures. These distances are calculated using binding site amino acids together with additional ones defined by the user, and their neighboring residues. As a result, ActSeek provides a ranked list of natural enzymes with a similar 3D arrangement of the active site residues to the seed (Fig. 1C).
To maximize the speed of the algorithm and enable searches over large datasets, query sequences are pre-screened before proceeding to computationally expensive comparison of 3D arrangements between seed and query proteins. This is done by checking that a query structure contains the three active site amino acids with mutual distances similar to the seed.
3.2 Searching new enzymes
To demonstrate the utility of ActSeek for finding new enzymes, we searched proteins that contain active sites characteristic of PHA synthases (Steinbüchel et al. 1993), and two different PET depolymerases PETase and MHETase (Yoshida et al. 2016). PHA synthases are used to produce biodegradable plastics (Poltronieri and Kumar 2017, Diniz et al. 2023), while PETase and MHETase degrade widely used plastics bearing potential in recycling applications (Maity et al. 2021, Seo et al. 2025). ActSeek results for these enzymes are compared with the sequence-based BLAST (Altschul et al. 1990) and structure-based FoldSeek (van Kempen et al. 2024) in Fig. 2A.
(A) Results for ActSeek searches over proteins in all organisms using active sites of a PHA synthase, PETase, and MHETase as seeds. (B) Results for ActSeek searches over human proteins with relevant binding pocket amino acids for eriotinib, sorafenib, and beta blockers as seeds.
For all these enzymes, ActSeek finds significant amounts of proteins that are not recognized by BLAST or FoldSeek searches (11 964 for PHA synthase, 11 115 for PETase, and 216 for MHETase). This means that a large number of enzymes containing the relevant active site, thereby potentially having the desired functions, are not found by BLAST or FoldSeek searches. On the other hand, these methods find many proteins with similar sequences or overall structures to the seed, but that does not contain the active site required for the desired reaction. These results demonstrate that ActSeek can (1) find new proteins with the desired active site that are not found by searches based on sequences or overall structures, and (2) filter out proteins that do not contain the relevant active sites from BLAST or FoldSeek results. The new natural enzymes found by ActSeek can bear novel properties that are potentially useful in a wide range of applications, from specificity for relevant substrates to functionality in harsh conditions.
Furthermore, our results demonstrate the potential of ActSeek to improve functional annotation of proteins. For example, 31 585 proteins are currently annotated as PHA synthases in the Uniprot database (UniProt Consortium 2022). However, 9080 of these do not contain the relevant active site according to the ActSeek. On the other hand, ActSeek finds 4875 proteins with the relevant active site for PHA synthesis, but that are not annotated for this function in the Uniprot database. Based on these results, ActSeek can provide new annotations for protein functions based on active sites, as well as filter incorrect annotations from databases.
3.3 Searching off-target proteins for drugs
Besides enzymes, drug binding to query proteins is another example where the local binding region is more relevant than the overall structure or sequence. Here we demonstrate that ActSeek can find potential off-target proteins for drug molecules. To this end, we searched all human proteins from the Alphafold database with binding pockets for two anti-cancer drugs, Erlotinib (Bareschino et al. 2007) and sorafenib (Yoshida et al. 2017), and for beta-blockers used to treat various cardiovascular diseases such as hypertension, cardiac arrhythmia, or myocardial infarction, and also alleviate severe anxiety (Poirier and Tobe 2014). Erlotinib and sorafenib inhibit epidermal growth factors and protein kinases, respectively, and beta-blockers block the receptor sites of adrenergic receptors. ActSeek results using the main targets of these drugs as seeds (Uniprot P00533, P15056, P07550, respectively) are compared with the sequence-based BLAST (Altschul et al. 1990), structure-based FoldSeek (van Kempen et al. 2024) in Fig. 2B and four structural motif search algorithms: Strucmotif-search (Bittrich et al. 2020), Folddisco folddisco, pyScoMotif (Cia et al. 2023), and ProBis (Konc and Janežič 2010).
ActSeek finds relatively few proteins with suitable active sites for binding of drugs studied here (190 for Erlotinib, 14 for Sorafenib, and 29 for beta blocker, the list of found proteins is provided in Supplementary Materials). Majority of ActSeek results are found also by Blast search based on sequences, but Blast finds also hundreds of proteins that do not contain the relevant binding site according to ActSeek, therefore being unlikely targets for the given drug (Fig. 2B). Structure based FoldSeek search finds hundreds of proteins with similar structure to the seed, but only few of them (21 for Erlotinib and two for beta-blocker off-targets) contain the relevant binding site, compromising its suitability for finding drug off-targets. Results from other structural motif search algorithms are shown in Table 1, available as supplementary data at Bioinformatics online.
For Erlotinib, ActSeek results contain 58 proteins that were not identified with sequence-based Blast search due to their low sequence similarity with the seed protein (Fig. 2). For instance, kinases with uniprot IDs of Q9H3Y6 and Q9NSY1 have <30% sequence similarity with the seed protein and only partial structural similarity, but they have relevant amino acids in the exact locations as in the seed protein (Fig. 3), which explains why they were found by ActSeek but not by Blast or FoldSeek. Surprisingly, the kinase Q9NSY1 was not returned by Strucmotif-search, Folddisco, or pyScoMotif. On the other hand, they return many proteins not found by ActSeek. After analyzing some of these proteins, we found that they are mainly spurious mappings. For example, pyScoMotif identified a kinase (UniProt ID P04049) with residues 718Leu, 790Thr, and 855Asp mapped to positions 303Leu, 119Thr, and 270Asp in the seed protein. However, structural superposition of both proteins reveals that the actual active site residues in this protein should be 19Ile, 92Thr, and 163Asp. This protein was not returned by ActSeek because the amino acid in position 19 is isoleucine instead of leucine, which was one of the requirements for the search.
Potential off-target for Erlotinib, tyrosine-protein kinase Srms (Uniprot: Q9H3Y6) (pink) superposed to the seed protein epidermal growth factor receptor (Uniprot: P00533) (light green). This protein was found by ActSeek but not by Blast or FoldSeek. The amino acids used for the search of Erlotinib off-targets are marked as blue (718Leu, 790Thr, and 855Asp). The corresponding amino acids in the tyrosine-protein kinase Srms are marked as red (236Leu, 302Thr, and 368Asp).
Previously reported off-targets for Erlotinib that may contribute to the common side-effects (Boehrer et al. 2008, Yamamoto et al. 2011) (JAK2, STK10) were identified by ActSeek, ProBis, Strucmotif-search, and Blast, but not by FoldSeek, Folddisco, or pyScoMotif.
For Sorafenib, ActSeek identified 13 potential off-targets. These include the retinal guanylyl cyclase2 (Uniprot P51841) found also by Blast, which may explain one of the reported drug’s side effects related to the retinal tear (Gaertner et al. 2014). Additionally, two of the identified drug targets are annotated as atrial natriuretic peptide receptor 1 and 2 (Uniprot P16066 and P20594) and were found only by ActSeek, pyScoMotif, and Folddisco (although they were filtered out from Folddisco due to incomplete mapping). Alignment of atrial natriuretic peptide receptor 1 with the main Sorafenib target, B-Raf kinase (Uniprot P15056), with Sorafenib docked is shown in Fig. 4. Inhibiting these receptors may cause high blood pressure and heart-related problems, which are also reported side effects of the drug (Maitland et al. 2009). However, none of these receptors have been previously reported as off-targets of the drug. Also, previously reported off-targets for Sorafenib, VEGFR 1, 2, and 3 (Wilhelm et al. 2006), are among the ActSeek results. VEGFR 1 is also found by the rest of the other structural motif search tools, VEGFR 2 was found only by Folddisco, and VEGFR 3 was found by all the tools except Folddisco.
Potential off-target for Sorafenib, atrial natriuretic peptide receptor 1 (Uniprot P16066, beige), aligned with the main target, B-Raf kinase (Uniprot P15056, light green). Amino acids involved in the binding are marked in red in the B-Raf kinase and blue in the atrial natriuretic peptide receptor 1. Sorafenib (magenta) has been docked to the B-Raf kinase structure.
For beta-blockers, 23 out of the 28 potential off-targets found by ActSeek are also identified by Blast. Successfully recognized case by Blast with relevant active site, 5-hydroxytryptamine receptor, is superposed with the seed in Fig. 5B. However, the usefulness for sequence-based search for this task is compromised by a large number of false-positive results. For example, Trace amine-associated receptor 9 (Uniprot Q96RI9) is ranked high in the Blast results (E-value: 4.2×10^−49^) and has a similar overall structure to the seed (Fig. 5A), but does not contain 290Phe and 312Asn amino acids, which are involved in the binding of the beta-blocker drug (Fig. 5C). ActSeek results for beta-blockers include different types of beta adrenergic (1, 2, and 3), serotonine receptors and dopaminergic receptors, among others. Interaction of beta-blockers with dopaminergic receptors has been previously reported (Ben-Uriah et al. 1981), yet specific receptors found by Actseek are potentially novel off-targets for beta-blockers. For instance, the olfactory receptors found by ActSeek could explain the reported side effects of beta blockers on taste and smell (Che et al. 2018). Surprisingly, beta adrenergic receptor 3 was not in the results of ProBis, and the dopaminergic receptors were not found by Structmotif-search. Folddisc and pyScoMotif found both types of receptors.
Comparison of results from ActSeek and Blast for beta-blocker off-targets. (A) Superposition of the alphafold structures for the seed beta2 adrenergic receptor (P07550), (pink), for a protein found by both ActSeek and Blast 5-hydroxytryptamine receptor 1B (P08908), (beige), and a protein found only by Blast but not ActSeek, Trace amine-associated receptor 2 (Q96RI9), (green). (B) Superposition of active site amino acids from the seed (red), and 5-hydroxytryptamine receptor 1B (blue) found by both Actseek and Blast. (C) Superposition of active site amino acids from the seed (red) and the Trace amine-associated receptor 2 (Q96RI9) (blue), which was found by Blast but not by Actseek.
3.4 Performance with respect to other enzyme mining tools
ActSeek search from a database of 2 million proteins takes ∼16 h with a single computer with one node, and <1 h using a cluster with 20 nodes. Because ActSeek does not need a preprocessing step when querying any custom database, its requirements for RAM memory (approximately 1 Gb) and diskspace (∼30 Mb) are negligible for any modern computer. In contrast, many algorithms used for structural motif searches require prior indexing of the proteins, posing significant requirements for disk space, memory, and computing time. For example, indexing 1.2 million proteins with Structmotif-search takes 36 h and half a terabyte of disk space (Bittrich et al. 2020). PyScoMotif needs 20 h and 30 min to index 200 000 PDB files using 12 parallel cores (Cia et al. 2023).
Comparisons of ActSeek results with other motif search algorithms reveal only partial overlap (see Table 1). Detailed analysis of these results demonstrates that all tested algorithms produce many spurious matches. We attribute these limitations to the absence of robust mechanisms for mitigating false positives. Most algorithms rely on metrics based on structural similarity scores or alignment standard deviations for whole proteins that are not directly tied to the actual residue mapping. As a result, they can accept matches that are generally similar to the seed, but have significant differences in the proximity of active sites, or reject proteins that have similar active site surroundings but are structurally different elsewhere. In contrast, ActSeek evaluates the structural similarity and amino acid content of the proximity of mapped residues to identify proteins with similar local environments and filter out spurious mappings.
In conclusion, ActSeek is expected to be more efficient when searching from large databases, such as the Alphafold database containing over 200 million structures (Varadi et al. 2024), than tools that require indexing.
4 Conclusions
We present here the ActSeek algorithm to rapidly search proteins with relevant active sites from the Alphafold database based on seed protein coordinates. Advantages of ActSeek to commonly used protein mining tools based on sequences (BLAST) and structures (FoldSeek) are demonstrated for finding new enzymes and drug off-targets. Drug off-target results are compared also with motif search algorithms that are limited to smaller datasets due to indexing requirements. Our results demonstrate that ActSeek can complement searches based on sequence or overall structure by finding proteins containing the relevant active site that are not recognized based on sequence or structure. On the other hand, ActSeek can refine searches by filtering out results with similar sequence or structure, but not having the relevant active site. These features increase the probability to find new proteins with desired functions, particularly when structural details of a local active site are crucial for function. This helps to focus further screening efforts on to most probable candidates.
Finding proteins with matching active site 3D conformations is particularly useful for proteins whose functions are based on specific binding of certain substances, such as enzymes and drug targets. For example, ActSeek can find natural proteins from databases with similar active sites to the seed enzyme, which can then be further screened for relevant novel functionalities, such as specificity for new substrates. Expressibility and biocompatibility of such natural proteins are expected to be, on average, better than for de novo designed proteins.
Another example of practical benefit is finding potential off-targets for drugs by searching human proteins with similar binding sites to known targets. Sequence-based search returns many false positive results for such a search, but ActSeek can filter out results that lack the relevant active sites, thereby substantially reducing the resources required for further screening.
We foresee wide range of applications for ActSeek in understanding and engineering biological systems from classification of protein functionalities for functional annotation to protein engineering, enzyme design and drug development.
We are making the results of ActSeek, Blast, and FoldSeek, Strucmotif-search, Folddisco, pyScoMotif, and ProBis available in the Supplementary Material, together with their corresponding queries.
Supplementary Material
btaf424_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Altschul SF , Gish W, Miller W et al Basic local alignment search tool. J Mol Biol 1990;215:403–10.2231712 10.1016/S 0022-2836(05)80360-2 · doi ↗ · pubmed ↗
- 2Arun KS , Huang TS, Blostein SD. Least-squares fitting of two 3-d point sets. IEEE Trans Pattern Anal Mach Intell 1987;PAMI-9:698–700.21869429 10.1109/tpami.1987.4767965 · doi ↗ · pubmed ↗
- 3Bareschino MA , Schettino C, Troiani T et al Erlotinib in cancer treatment. Annal Oncol 2007;18:vi 35–41.10.1093/annonc/mdm 22217591829 · doi ↗ · pubmed ↗
- 4Berman HM , Westbrook J, Feng Z et al The protein data bank. Nucl Acids Res 2000;28:235–42.10592235 10.1093/nar/28.1.235PMC 102472 · doi ↗ · pubmed ↗
- 5Bittrich S , Burley SK, Rose AS. Real-time structural motif searching in proteins using an inverted index strategy. P Lo S Comput Biol 2020;16:e 1008502.33284792 10.1371/journal.pcbi.1008502 PMC 7746303 · doi ↗ · pubmed ↗
- 6Boehrer S , Adès L, Braun T et al Erlotinib exhibits antineoplastic off-target effects in AML and MDS: a preclinical study. Blood 2008;111:2170–80.17925489 10.1182/blood-2007-07-100362 · doi ↗ · pubmed ↗
- 7Burley S , Bhatt R, Bhikadiya C et al Updated resources for exploring experimentally-determined pdb structures and computed structure models at the rcsb protein data bank. Nucl Acids Res 2025;53:D 564–74.39607707 10.1093/nar/gkae 1091 PMC 11701563 · doi ↗ · pubmed ↗
- 8Castillo S , Ollila S. 2025. Actseek: Supplementary Material. 10.5281/zenodo.15621934 (11 June 2025, date last accessed). · doi ↗
