Comparative Dynamics Enables Discovery of Embedded Bacterial Ferredoxin Domains in Large Redox Enzymes
Jan A. Siess, Vikas Nanda

TL;DR
Researchers used protein dynamics to find ancient bacterial ferredoxin domains inside larger enzymes, suggesting a new way to study protein evolution.
Contribution
The study introduces comparative dynamics as a novel method to detect embedded ferredoxin domains with limited structural or sequence homology.
Findings
Fragments of bacterial ferredoxin domains were identified in larger enzymes using comparative dynamics.
Dynamical similarity reveals functional relationships even when structural or sequence homology is limited.
Protein dynamics are more evolutionarily constrained than structure, suggesting functional conservation.
Abstract
Bacterial ferredoxins are small iron–sulfur binding proteins that function as soluble electron shuttles between redox enzymes in the cell. Their simple 2×(β–α–β) fold, central metabolic function, and ubiquity across all kingdoms of life have led to the proposal that ferredoxins were likely among the earliest proteins. Today, ferredoxin‐like folds are embedded in large, multidomain enzymes, suggesting ancient gene duplication and fusion events. In some cases, these embedded domains may have scant sequence or even structural homology to soluble counterparts, challenging the use of traditional phylogenetic tools to establish evolutionary relationships. In this study, we identify fragments of bacterial ferredoxins within larger oxidoreductases by integrating comparative sequence, structure, and dynamical attributes. Dynamics are computed using an elastic network model and analyzed for…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
FIGURE 1
FIGURE 2
FIGURE 3
FIGURE 4
FIGURE 5
FIGURE 6
FIGURE 7
FIGURE 8- —National Institute of General Medical Sciences10.13039/100000057
- —NASA Astrobiology Institute10.13039/100012627
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Photosynthetic Processes and Mechanisms · Enzyme Structure and Function
Introduction
1
Modern oxidoreductases (EC1 enzymes) are large nanomachines essential to driving electron‐transfer reactions. They contain prosthetic groups comprising a select set of transition metals that facilitate catalysis and are present within all forms of life [1]. Despite their ubiquity, the deep‐time evolutionary origins of these enzymes remain poorly understood. It is hypothesized that modern oxidoreductases represent an assembly of smaller folds that had undergone a series of gene duplication, combination, and sequence divergence events over billions of years [2, 3, 4]. Uncovering the trace of these ancient processes is difficult due to the statistical limitations of current phylogenetic methods over such immense timescales.
Traditional sequence‐based phylogenetic methods, applied at both the gene and genome levels, have provided deep insights into redox protein evolution [5, 6, 7, 8, 9]. Yet, these methods often struggle to resolve ancient divergence events spanning billions of years [10, 11, 12]. Comparative three‐dimensional structure analysis [13, 14, 15, 16], especially when combined with functional insights from cofactor interactions [4, 17, 18, 19, 20], has also been employed to tackle this challenge. However, in large multidomain oxidoreductases, where domain fusion and insertion events obscure homology, purely sequence‐ or structure‐based approaches fail to identify distantly related domains. As a result, there is a clear need for complementary methods that can elucidate these hidden evolutionary relationships.
Recent structure‐cofactor interaction network analyses have proposed that close spatial proximity of distinct folds may represent “fossil evidence” of divergent evolution in large proteins [17, 18], arising from domain duplication and fusion followed by extensive diversification. One example is the proposed transition between the bacterial ferredoxin fold and the Rossmann‐like family domains [17]. Yet, these studies often rely on strict structural proximity cutoffs (e.g., microenvironments within a 15 Å sphere centered around the cofactor), which may overlook domain boundaries, discontinuous segments, and domain insertions. While cofactor‐driven inference has been indispensable toward inferring deep‐time protein fold evolution, a different approach is needed—one that can detect evolutionary signals even when structure and sequence similarities have been heavily eroded.
Although protein domains are often described in terms of independent folding, modular function, or evolutionary selection [21, 22], no single definition can fully encapsulate a protein domain. Traditional domain identification methods—whether homology‐based [23], ab initio [24], or structure‐focused [25]—have become more powerful with the rise of large sequence databases and both machine‐ and deep‐learning‐driven structure prediction tools [26, 27, 28, 29]. Still, discontinuous or highly modified domains remain difficult to characterize. Notable, recent, approaches that incorporate protein dynamics‐defining domains as regions undergoing collective movements‐offers a new way to investigate otherwise cryptic domain boundaries [30, 31].
Protein structure tends to evolve more slowly than protein sequence [32, 33, 34], but the functional motions of a protein—its dynamics—could be even more conserved if those motions are pivotal for its activity [35, 36]. We propose that low‐frequency, large‐amplitude normal modes play a particularly important role: if these global motions underpin function, then they may remain conserved over immense evolutionary timescales, even when sequence and static structure have diverged [37]. If this is the case, dynamics may not only delineate domain boundaries of a large multi‐domain protein, but could enable us to detect homology among remote or cryptic domains beyond the structure and sequence level.
Comparative dynamics can be pursued using strategies at differing levels of granularity, from coarse‐grained elastic network models to all‐atom molecular dynamics (MD). Anisotropic network models (ANM) reduce a protein into a mass‐spring network where each residue is represented by the position of the backbone alpha‐carbon [38]. ANM computations are rapid, but do not explicitly consider chemical properties of the amino acids. In contrast, MD simulates explicit atomistic motion, capturing side‐chain interactions and solvent effects over time. Movement is calculated through the application of an empirical forcefield on all atoms, which incorporates bonded terms like bond stretching, angle bending, torsions, as well as non‐bonded interactions. Newton's equations of motion are then integrated in discrete time steps to generate trajectories depicting detailed structural dynamics at atomic resolution [39]. However, this level of precision comes at substantial computational cost, especially for large systems or long simulation periods, where most meaningful biological processes take place [40, 41, 42, 43]. ANM, conversely, while unable to replicate the same level of detail and resolution as captured by MD, has been shown to effectively predict functionally relevant, large‐scale motions at significantly lower computational expense [44].
Normal mode analysis (NMA) [45, 46, 47] is a mathematical approach that extracts collective motions, identifying essential modes corresponding to large‐amplitude, low frequency global movements in proteins. NMA can be applied to atomistic trajectories generated by MD simulations, or through simplified coarse‐grained models such as ANM. Despite differences in computational complexity and granularity resolution‐wise between these methods, both ANM‐based and MD‐based NMA identify similar essential modes [38, 48, 49]. By focusing on these large‐scale modes, which are known to mediate key functional transitions [50, 51, 52], we can discern evolutionary signals that might otherwise be masked.
In this study, we use ANM for its computational efficiency in analyzing very large, multidomain oxidoreductases. It also provides the requisite data for conducting comparative dynamics analysis, leading to the quantification of dynamical similarity between two structures. When the term “dynamical similarity” is used, we are referring to the extent to which two proteins (or protein subdomains) exhibit comparable large‐scale motions, as determined by overlap in their lowest‐frequency normal modes. Formally, we quantify this by comparing these modes—often through dot products or cosine similarities—to see how closely the direction and spatial displacements of the essential functional motions align between different structures. Because these large‐amplitude movements directly underpin function, dynamical similarity can highlight deep evolutionary relationships that may be missed by purely sequence‐ or structure‐based comparisons.
NMA‐based approaches have been illustrated in globins, where the first few normal modes are highly conserved across the family, reflecting shared functional motions [53, 54]. Because these large‐scale, global motions often coincide with essential functional rearrangements [37, 46, 55], and are robust against random mutations [56], it would necessarily follow that evolution should be selecting for conservation of the first few normal modes. Higher‐frequency modes, by contrast, tend to differentiate subfamilies, highlighting how dynamics both unites and diversifies proteins at different evolutionary scales [57].
Bacterial ferredoxins, in particular, provide an ideal test case for this framework. These small (typically 50–80 amino acids) iron–sulfur binding proteins are among life's earliest domains [58, 59], likely serving as electron shuttles in primitive metabolic networks. Their ubiquitous 2×(β–α–β) fold is found both as an isolated domain and embedded within large multidomain oxidoreductases, where repeated gene fusion and sequence diversification events may obscure direct sequence or structural homology, making their isolation and characterization challenging.
In this study, we integrate sequence, structure, and comparative dynamics to detect ferredoxin‐like domains buried in massive oxidoreductases. We focus on a small soluble 2[4Fe‐4S] ferredoxin from G. acidurici [60] and compare its essential modes with those of several larger oxidoreductases that contain multiple 4Fe‐4S clusters forming an electron‐transfer relay. Similar to approaches that identify repeats or subdomains based on local similarity in sequence and structure [61, 62], we incorporate an ANM‐based comparison of the lowest‐frequency normal modes. Notably, we find that dynamical overlap can reveal hidden ferredoxin domains—even when structural similarity is minimal—implying that the global, low‐frequency motions may remain evolutionarily constrained (Figure 1). By jointly analyzing sequence, structure, and dynamics, we offer a robust framework for investigating the deep‐time evolution of protein domains, shedding light on how ancient building blocks persist within modern protein nanomachines [18].
Ferredoxins embedded within larger structures exhibit similar motions to smaller, single domain ferredoxins: a structural analysis of a bacterial ferredoxin from Clostridium acidurici (PDB: 2FDN, top row) and a polyferredoxin segment from Methanothermobacter wolfeii (PDB: 5T5I, bottom row), illustrating homology through dynamics. Normal mode vectors obtained from an anisotropic network model (ANM) are used to generate per‐residue displacement vectors, which are shown to highlight the dynamic behavior of each residue. Dynamics of ancient domains are conserved in the context of larger protein environments, showcasing the similarity of motion between smaller ferredoxins and the larger polyferredoxin.
Methods
2
Anisotropic Network Models (ANM)
2.1
Coarse‐grained NMA was performed using an ANM in the ProDy package (Version 2.3.1) [63]. The analysis included the main‐chain carbon‐alpha (Cα) atoms of a query ferredoxin structure and three large oxidoreductase targets, each containing putative ferredoxin‐like domains. Within the ANM framework, each Cα atom is represented as a node in a uniform mass‐spring network, where springs of constant γ = 1.00 kcal/(mol Å [2]) connect pairs of nodes separated no more than 15 Å (the cutoff distance, denoted as r _ c _) [38].
A Hessian matrix, reflecting small structural displacements around the energetic minimum on the potential energy surface, was constructed by computing the second derivative of the potential energy with respect to the Cα node coordinates [64]. Diagonalizing the Hessian matrix in ProDy yields eigenvectors (normal modes), which describe collective atomic motions, and eigenvalues, which correspond to the frequencies or energies of these modes. We selected the top 10 eigenvectors to calculate cross‐correlation matrices, which represent the correlation of movement patterns between all residue pairs within proteins.
Tiling Protocol and Dynamical Similarity Measures
2.2
Ferredoxin sequences were divided into contiguous subsequence tiles using a sliding‐window approach implemented in a custom Python script (see supplemental code and Figure S1). Each tile has a length L, where L ranges from six amino acids up to the full protein length N. The lower limit of six amino acids follows previous structure‐tiling work, where this length was shown to detect key structural motifs while maximizing sequence coverage [61]. To generate the complete set of tiles for a protein of length N, we consider each integer L in the interval 6 ≤ L ≤ N. For a specific length L, the i‐th tile T i spans a contiguous subsequence of length L. Because we shift this window of length L across the entire protein, i (the tile index) ranges from 1 to (N–L + 1). Consequently, the total number of tiles N t can be computed as:
After generating every possible tile T i of length L, we form all possible pairwise comparisons between tiles. We maintain a direct mapping between each amino acid's position in a tile and its residue index in the original structure. This residue‐based mapping allows us to retrieve the corresponding rows and columns in a precomputed cross‐correlation matrix, denoted as C. Specifically, for each pair of tiles T i and T j (both of length L), the indices of the amino acids in those tiles are used to extract a square submatrix C ij containing only the cross‐correlations among residues in T i and T j. These submatrices capture both the spatial arrangement of residues and their dynamical couplings as encoded in the full correlation matrix. Each submatrix is then flattened into two vectors ** u ** _ ** i ** _ (corresponding to T i) and ** v ** _ ** j ** _ (corresponding to T j). Once ** u ** _ ** i ** _ and ** v ** _ ** j ** _ are obtained, the cosine similarity (Figure 2A) between the two vectors measures their directional alignment. Mathematically, this is defined as:
where ** u ** _ ** i ** _ and ** v ** _ ** j ** _ is the dot product of the two vectors and ||·|| denotes the Euclidean norm. The cosine similarity ranges from −1 (perfectly opposite orientations) to 1 (identical orientations), with 0 indicating orthogonal vectors.
Overview of the methodology. (A) Depicts the three primary parameters (sequence similarity, structure similarity, dynamical similarity) and their distributions across different tile lengths and center indices. (B) The weighted sum for each element is taken as the sum of the product of the weights with each parameter. With the weighted sums calculated, the tiles are iterated over again and checked for the presence of any cysteines. If there are cysteines within an established bonding range to irons within an iron–sulfur cluster, then that tile is kept. If not, then they are discarded. This provided us with the final saw‐tooth pattern. Each of the “blades” of the sawtooth are designated as an “island,” and borders between any two islands are blocked. This is to constrain the search area when calculating the optimal path via the dynamic programming algorithm.
In addition to assessing dynamical similarity via cosine similarity, we also measure the pairwise sequence similarity between any two tiles (Figure 2A). These calculations use the Smith‐Waterman algorithm from the Bio.Align sub‐package in the Biopython API [65], with the BLOSUM62 substitution matrix and default gap‐penalty parameters. The raw alignment score is divided by the average of the two tiles' maximum possible alignment scores, yielding a normalized similarity percentage for each comparison.
Lastly, to quantify structural differences between two tiles, we compute the Frobenius distance of their respective submatrices (Figure 2A). If Q represents the query tile's submatrix and T is the target tile's submatrix, both of size m × m, we first define a difference matrix D = Q–T. The Frobenius norm of D is then computed as:
The final portion of the expression yields us with the matrix inner product denoted as ⟨D, D⟩. Because the Frobenius norm is induced by a natural matrix inner product, we can reduce the above expression to the following:
Substituting Q–T for D gives:
where ** Q ** _ ** ij ** _ and ** T ** _ ** ij ** _ are the (i,j)‐th entries of matrices ** Q ** and ** T **, respectively. The resulting Frobenius distance is a single scalar value representing how dissimilar the two submatrices are in terms of their structural arrangement. Because every comparison involves two tiles of the same length L, the residue indices that define T i and T j map onto sub‐matrices of identical dimension (m = L). This equality is an intrinsic consequence of the sliding‐window approach. As a result, the Frobenius distance is therefore computed directly on D without any rigid‐body rotation or translation; the correlation coefficients are already invariant to such motions (Figure S2). Smaller Frobenius distances indicate greater similarity between the residue‐pair coupling between the two tiles, whereas larger values signify major differences.
However, since the Frobenius distance is a metric denoting dissimilarity, we must convert it into a bounded similarity before combing it with cosine and Smith–Waterman scores. Following the continuous form of the Gower's similarity coefficient [66], the transformation is
Where d_Frobenius,max_ is the largest Frobenius distance observed across all tile pairs in the dataset. The mapping is affine and strictly monotonic, so it preserves rank order while aligning the direction of the metric with the two similarity‐based scores. After this step, all correlations can now be averaged coherently via the Bayesian objective function in Equation (9).
Determination of Optimal Weights Using Bayesian Optimization
2.3
Bayesian optimization was employed to objectively determine the best weights for each of the similarity metrics–cosine similarity, the Frobenius distance, and sequence similarity—so that a combined, weighted similarity measure would align closely with each of the individual metrics. Mathematically, for a given query x, the weighted similarity S _ weighted _ (x) can be expressed as:
where each S _ metric _ (x) (metric ∈ [cosine, Frobenius, SW]) denotes the respective similarity score, and w _ metric _ is the weight assigned to that score. The subscript “SW” refers to sequence similarity obtained via the Smith–Waterman (S sw) algorithm.
To quantify how well S weighted(x) correlates with each individual similarity measure, we define three separate correlations:
where corr(A, B) denotes the statistical correlation between two sets of values A and B. The objective function O to be maximized, is given by the average of these three individual correlations:
A Gaussian Process (GP) model was used to approximate the unknown objective function O [67]. By definition, a GP is fully specified by its mean function m(x) and covariance function k(x,x′). For simplicity, a constant value of 0 was chosen for the mean function, while the covariance was modeled by the squared exponential kernel:
where 𝜎 represents the signal variance, and 𝘭 is the length‐scale parameter controlling the smoothness of the function. Once this GP was established, the Expected Improvement (EI) criterion was employed as the acquisition function, guiding the search for optimal weight values by balancing exploration and exploitation. The EI is defined as:
where f(x*) is the best observed value of the objective function up to the current iteration, and f(x) is the model's predicted value at a new point x.
In this study, the Bayesian optimization procedure was run for 115 iterations. The initial 15 iterations randomly sampled different weight combinations to build an initial training set for the GP model. The remaining 100 iterations were guided by the acquisition function, selecting new points x (i.e., sets the new weights) that were most likely to improve upon the best current solution. At the conclusion of this process, the procedure yielded a set of optimal, normalized weights (w _ cosine _, w _ frobenius _, w _ SW ) for each query‐target structure pair. Notably, these weights were consistent across different structures in our data set, indicating that the relative contributions of the individual metrics remained stable. The final weighted similarity S_weighted(x) computed with these optimal weights was then used to determine the most favorable pathways through the islands under investigation.
Computing the Optimal Paths for Advantageous Tile Sequences
2.4
The algorithm presented here identifies a sequence of tile lengths and tile center indices—collectively referred to as the optimal path—by maximizing a weighted sum encoded within a structure matrix called the IslandTileMatrix. In this matrix, the row index 𝘪 corresponds to a discrete tile length, and the column index j corresponds to a tile center index. Each entry in the matrix therefore represents the contribution (or weight) of adopting a specific tile length 𝘪 at a specific tile center index j. Larger tile lengths generally capture more of the underlying signal and are considered advantageous, but they can potentially obscure more localized dynamics. As a result, the algorithm seeks to balance these competing effects while identifying the path through the matrix that yields the greatest cumulative weight.
To accomplish this, two supplementary matrices of the same dimension as the IslandTileMatrix are maintained throughout the algorithm: the dynamic programming matrix (DPM) and the traceback matrix (TM). The DPM is initialized such that its first row, 𝘪 = 0, receives values directly from the IslandTileMatrix. This initialization step enforces a defined starting state for the algorithm, although it is possible to restrict the range of valid tile center indices j if the application requires a narrower starting configuration. Formally, the initial condition for the DPM in Equation (12) is expressed as:
As the algorithm advances row by row (i.e., by increasing 𝘪), each grid point DPM_ ij , is updated using the maximum value from the set of neighboring columns in the previous row. Specifically, for a cell indexed by (𝘪, j), the algorithm looks up to two columns on each side—namely (𝘪 − 1, j − 2), (𝘪 − 1, j − 1), (𝘪 − 1, j), (𝘪 − 1, j + 1), (𝘪 − 1, j + 2)—in the preceding row 𝘪 − 1. It then adds the corresponding IslandTileMatrix value, IslandTileMatrix_ij, to the maximum among those potential predecessors. This process can be expressed as the recurrence relation in Equation (13):
At the same time, the TM stores the column index ĵ that produces this maximum value for each (𝘪, j). Once the algorithm has filled all rows of the DPM, the optimal path is reconstructed by locating the maximum entry in the final row and then iteratively backtracking through the TM. The resulting optimal path comprises an ordered list of (𝘪, j) pairs—each pair containing a tile length and a tile center index—that capture the globally optimal traversal through the IslandTileMatrix. This set of pairs can be denoted as follows:
Upon completing the traceback, the final sequence of tile lengths and tile center indices reflects the best balance between choosing larger tiles to capture more extensive signal and sufficiently small tiles to preserve the underlying local dynamics. All detailed implementation steps for this procedure are provided in the Supporting Information labeled “OptimalTilePathAlgorithm_Version1.0.py.”
Ranking and Mapping Isolated Tiles Back Onto Protein Structure
2.5
All three‐similarity metrics used in this study were ranked and then summed to form a single composite score (Figure 2B). Let N denote the total length of the protein structure under examination, and let each tile have length L, where 6 ≤ L ≤ N/2. Each tile in this length range was sorted in ascending order according to its composite score, and any tiles whose cosine similarity or sequence similarity fell below 50% was discarded. From the remaining candidates, the top 20% were written to temporary files, and once the tile length reached N/2, all of these partial outputs were concatenated into a single dataset. A subsequent refinement step then excluded tiles if they lacked cysteine residues altogether or if the spatial positions of those cysteine residues were inconsistent with typical iron–sulfur (Fe–S) bonding distances, specifically lying outside the range 2.18 Å ≤ r ≤ 2.35 Å [68, 69]. This final filtering measure helps avoid spurious matches in structurally unrelated motifs, such as Rossmann‐like folds or other β–α–β super‐secondary structures. Since each tile's residue indices map directly to those in the corresponding PDB file, the method produces a PyMOL script that highlights all identified matches on the protein structure, enabling straightforward visualization of these tiles in a three‐dimensional context.
Results and Discussion
3
Query and Target Selection
3.1
To identify ferredoxin subdomains within larger metalloproteins, we obtained structures and sequences from the Protein Data Bank (rcsb.org). The query structure was the 2[4Fe‐4S] ferredoxin from G. acidurici (PDB: 2FDN, N = 55) [60]. As target structures, we selected three larger metalloproteins: a formyl‐methanofuran dehydrogenase (Fwd) from Methanothermobacter wolfeii (PDB: 5T5I, N = 3622) [70], dihydropyrimidine dehydrogenase (DPD) from Sus scrofa (PDB: 1GTE, N = 2010) [71], and a methanogenic heterodisulfide reductase from Methanothermococcus thermolithotrophicus (PDB: 5ODC, N = 4011) [72]. Previous structure‐based approaches identified common ferredoxin‐like domains in Fwd, DPD, and the heterodisulfide reductase [18]. Furthermore, DPD and the heterodisulfide reductase contain variant iron–sulfur folds that share low structural similarity with ferredoxin.
For each query/target pair, we computed cross‐correlation matrices using the coarse‐grained ANM as implemented in ProDy [38, 63]. Each cross‐correlation matrix is of dimension N × N, where N denotes the total number of residues in the structure being analyzed. The entry in the 𝘪‐th row and j‐th column of this matrix reflects the dynamical correlation between the 𝘪‐th and j‐th residues. When generating a tile of length L from a query sequence, the residue indices in that tile map directly to the corresponding residue indices in the N × N cross‐correlation matrix, allowing for precise extraction of a relevant submatrix. This one‐to‐one mapping ensures that the subset of the matrix fully corresponds to the positions of amino acids in both sequence and structure. Dynamical and structural similarity between each query tile and target tiles in the larger protein was subsequently evaluated using two complementary metrics: cosine similarity and a Frobenius‐distance‐derived similarity score. By comparing these submatrices across sliding‐window tiles, the algorithm uncovered correspondences that highlight local ferredoxin‐like subdomains within broader protein folds.
Tile Length Selection in Dynamical Similarity Detection
3.2
The choice of tile length, denoted here as L, is critical for capturing the most informative range of local and global motions in dynamical similarity analyses. Shorter tiles provide higher resolution of localized dynamics but risk overlooking broader global features. Conversely, longer tiles resolve larger‐scale patterns, but they do so at the possible expense of losing key local motions. Balancing these two considerations allows for both sensitivity (the likelihood of detecting meaningful local correlations) and specificity (the avoidance of spurious matches driven by overly large tiles).
A similar dilemma has long been recognized in sequence alignments. Early heuristic programs such as FASTA introduced the idea of scanning for short k‐tuple “words,” and Altschul et al. [73] later embedded a tunable word size W in the BLAST algorithm to balance speed and sensitivity. They showed that shorter words increase sensitivity at a computational cost, whereas longer words accelerate searches but can miss important alignments. Although BLAST originally proposed a word size of four residues as a compromise [73], subsequent studies suggest that windows of 10 and 20 residues best preserve local amino acid composition while minimizing statistical noise in homology detection [74].
Likewise, in structural analyses, Parra et al. [61] pioneered a tiling approach and recommended a starting tile length of six. Tiles shorter than six residues were deemed too small for meaningful structural comparison based on alpha carbon traces, while a tile length of at least six effectively captures common structural elements such as beta strands (4–6 residues) and smaller alpha helices (6–40 residues) [75, 76]. Within this study, we retained this lower bound of L = 6 for identifying subdomains, but our upper bound was naturally set by the length of the query protein, N = 55. As a result, the largest tile considered spans the entire query (i.e., 55 residues), with different tile center indices explored along the length of the target proteins.
To determine the most suitable tile lengths for detecting meaningful dynamics, we performed a polynomial‐based non‐linear regression on the average weighted sum of our three‐similarity metrics (cosine similarity, sequence similarity, and Frobenius distance) as a function of L. We denote this function as S(L). By examining the derivative of S′(L), we identified points where changes in the average weighted sum stabilized, implying a balance between capturing local detail and broader motion. To enhance the detection of these stabilization points, the derivative was first smoothed with a Gaussian filter and then subjected to a rolling average. The optimal tile length for each path was taken as the smallest L for which the smoothed derivative approached zero, indicating diminishing returns in the average weighted sum. Because different paths—defined by their unique sequence and structural contexts—may converge at different rates, this approach was applied on a per‐path basis rather than relying on a single global threshold. An example illustrating this derivative‐based convergence in the formyl‐methanofuran dehydrogenase (Fwd) data set is provided in (Figure S3).
A Poly‐Ferredoxin Target
3.3
We considered the Fwd domain as a positive control for the comparative dynamics method (Figure S4). Fwd has 12 subunits—a dimer of heterohexamers (FwdABCDEFG). Of particular interest are the F and G subunits. These two subunits form an electron‐supplying core, with the FwdF dimer acting like a wire [70]. The FwdF subunit is a frank polyferredoxin with four ferredoxin domains binding a total of eight [4Fe–4S] clusters. This would be relatively straightforward to detect even by sequence or structure‐based methods. In contrast, the FwdG subunit contains one ferredoxin domain binding 2[4Fe–4S] clusters [70, 77, 78, 79], with a topology that diverges from the canonical fold (β–α–β half + α‐helical half). Additional iron–sulfur clusters exist in the dodecamer—such as in FwdB.
FwdF is a positive control with a previously described polyferredoxin fold [70]. After filtering for cysteines within bonding range of an iron–sulfur cluster in FwdF, we observe eight separate islands consistent with the original structural study detailing four embedded ferredoxin domains binding eight [4Fe–4S] clusters [70, 77, 78, 79]. With the exception of islands B, F, and G, the clear structural homology of the embedded domains to the β–α–β query allowed rapid convergence on optimal tile lengths and positions for each of the islands (Figure 3).
Ferredoxin domains in the FwdF subunit. The optimal path, as determined by the dynamic programming algorithm, is illustrated above via the red lines and squares in (A). The tile length and center index combinations along this path represent portions of the original structure that map back onto it, colored according to their tile island ID, as shown in (B). These optimal matches correspond to all of the ferredoxin halves contained within the subunit.
Two distinct islands (A and B) were observed in the FwdG tiling plot (Figure 4A). Both share dynamics, structure, and sequence with the β–α–β half of a bacterial ferredoxin. This is perhaps surprising, given that island A corresponds to an all alpha‐helical fold (Figure 4B). This is reflected in the optimum path through the island. Island A contains more branching paths than island B. These multiple paths at shorter tile lengths indicate local similarity between the query and target. Convergence at longer tile lengths suggests an optimal global match between the helical target and the β–α–β query. This indicates that comparative dynamics can sometimes supersede lack of clear sequence or structural homology.
Ferredoxin domains in the FwdG subunit. After calculating the weighted sum and applying the cysteine‐iron bond length constraints in (A, top), we're left with two distinct peaks which represent each half ferredoxin half. In (A, bottom), this is used as input for the dynamic programming algorithm which constructs the optimal path by maximizing the values of the weighted sum along each step. The highlighted portions of the fold in (B) are those sequences selected by the algorithm which have satisfied that requirement.
Ferredoxin Subdomains in DPD
3.4
DPD chain A contains multiple iron–sulfur binding domains, some of which are all‐helical instead of the typical β–α–β fold. Four islands are detected. Islands A and B correspond to two halves of the α‐helical domain (from position 50 to 162), and islands C and D correspond to the β–α–β fold (position 934 to 992) (Figure 5A). Optimal paths converge quickly around a tile length of 20, with the exception being island B, which fully converges around a tile length of 30. Interestingly, the optimal path for island D appears to converge to the boundary separating the two islands—indicating the best match to the query is a full 2(β–α–β) fold (Figure 5B)—whereas the optimal paths for islands A and B cover the entirety of the full α‐helical domain.
Ferredoxin domains in Chain A of DPD. After filtering the weighted sums with the cysteine‐iron bond length constraints, only four peaks remain. Each peak here references a ferredoxin half. Optimal matches obtained in (A) and mapped onto the structure in (B) pinpoint the ferredoxin domains precisely in DPD.
These non‐canonical ferredoxins are found in many other oxidoreductases. Pyruvate: ferredoxin oxidoreductase (PFOR) contains a 2(β–α–β) bacterial ferredoxin and an all‐helical quinol‐fumarate reductase ferredoxin in distinct regions of the protein [80]. Despite the structural differences, sequence similarity between the two ferredoxin domains in PFOR are substantial [80]. Instances such as this highlight the importance and utility behind using multiple similarity metrics for sequence, structure, and dynamics in comparative domain detection.
Ferredoxin Subdomains in the HdrABC–MvhAGD Complex
3.5
The heterodisulfide reductase‐NiFe hydrogenase linked enzyme (HdrABC–MvhAGD), found in the cytoplasm of most methanogenic archaea [81], is essential for the first step of methanogenesis. It reduces the heterodisulfide of coenzymes M, B, and ferredoxin through the oxidation of H_2_ [82]. The complex contains a multitude of bound iron–sulfur clusters of varying stoichiometries, including a non‐cubane [4Fe–4S] cluster and a [2Fe–2S] cluster. Further, the topologies of these ferredoxins are also varied, ranging from canonical half‐ferredoxins to all α‐helical ferredoxins (Figure 6B). These different [Fe–S] clusters all use cysteine as the first‐shell ligand and would be selected in the initial filtering step. This tests whether the dynamics‐based approach can identify [4Fe–4S] binding ferredoxins.
Optimal matches mapped onto HdrABC‐MvhAGD. (A) Ferredoxin domains in the expansive heterodisulfide reductase are outlined here in yellow. (B) Ferredoxin fold variety within the structure depicted as a cartoon. Domains here comprise of half ferredoxins (1xβαβ), truncated half‐ferredoxin pairs (1–2xβα), and all α‐helical ferredoxins.
The structure of Hdr is larger and more complex than the previous targets. Still, despite that, the algorithm quickly converged on several [4Fe–4S] ferredoxin domains (Figure 6A and Figure S5). No matches were found with [2Fe–2S] ferredoxin domains, suggesting that these plant‐type ferredoxins do not match bacterial ferredoxins at the sequence, structure, or dynamical level. Despite structural variations between corresponding chains of the two heterohexamers, the same tiles were identified.
In chains A and G, we find that the ferredoxin query maps onto portions of two Rossmann folds of Hdr, each binding one [4Fe–4S] cluster (Figure 7A,B). This observation is consistent with previous analyses suggesting a common ancestor to ferredoxin and Rossmann‐like folds [18]. The fact that these areas exhibit dynamical overlap, in addition to sequence and structural similarity, supports the hypothesis of shared ancestry. This also highlights that functionally similar units, as determined by their dynamic properties, can be used to infer homology.
Optimal matches within four islands in HdrG depict mixed Rossmann and Ferredoxin Domains. (A) Optimal matches from Islands A, B, E, and F are highlighted within the context of the greater structure. (B) Optimal matches from these islands are magnified and look like half‐ferredoxins that have been inserted into Rossmann Folds.
Dynamical vs. Structural Similarity
3.6
This study rests on the premise that, among the three key factors—sequence, structure, and dynamics—dynamics is the most conserved over deep evolutionary time as it is most directly linked to protein function. From this standpoint, we hypothesize that overlapping dynamic behavior implies functional similarity and that the long‐term conservation of these dynamics may indicate homology between structural units. To assess this quantitatively, we compared the structural (Frobenius) similarity against dynamical (cosine) similarity for all query‐target tile pairs of L = 30 residues in length (Figure 8).
Dynamical and structural similarity metrics exhibit a monotonic, curvilinear relationship. Cosine similarity is plotted against the Frobenius similarity across all tile center indices for tiles 30 residues in length. The distribution depicts a curvilinear relationship, with faster rates of change observed for cosine similarity at the lower right and for Frobenius similarity at the upper left. This pattern persists across each model's distribution, where an inflection point indicates a change in concavity, illustrating where structural similarity degrades at a rate that is incongruent with dynamical similarity.
In the resulting plot, ferredoxin‐Fwd query‐target pairs concentrate in the lower‐right portion, where both structural and dynamical similarity scores are relatively high. This pattern reflects the clear repetition of β–ɑ–β subdomains that form part of the larger Fwd protein. The relationship between structure and dynamics appears to be nonlinear: as structural similarity degrades, the corresponding dynamical similarity remains comparatively robust, suggesting that a protein's functionally related motions can persist even when homology is less pronounced (Figure S6). Hdr and DPD show generally weak similarity on both axes but follow the same nonlinear trend, supporting the idea that the dynamical features of ferredoxins embedded within larger oxidoreductases are preserved over deep time. Finally, residue ranges for each of the islands and their respective structures are highlighted in (Table S1).
Conclusion
4
Of protein sequence, structure, and dynamics, the last is most directly connected to function, and therefore evolutionary selection. Dynamical information coupled with sequence and structural attributes can identify bacterial ferredoxin regions within larger proteins. We have been able to isolate ferredoxin folds in target structures that are topologically dissimilar to the query. We find embedded ferredoxin domains within larger proteins, including all α‐helical topologies and β–α–β motifs inside Rossmann folds, which supports the inference of homology derived from conserved functional dynamics.
The success of our method depends on the appropriate selection of tile length. When applying this method to domains embedded within other proteins, we ensure that the maximum permissible tile length is bounded by the length of the parent domain or input query structure. In this work, the query structure, a 2×[4Fe–4S] bacterial ferredoxin, remained consistent while the target structures varied. Therefore, the tile length was constrained to a range between a minimum of 6 residues and a maximum of 55 residues. However, heme‐binding cytochromes are also well characterized [83, 84], and include both small, soluble electron carriers and embedded cytochrome folds within larger enzymes and protein nanowires. Given the centrality of redox catalysis in the evolution of metabolism, we seek to improve upon and extend this approach to structures beyond ferredoxins, such as globin folds or cytochrome P450s, where their tile lengths would range from 6 to the maximum length of the parent domain.
Finally, while the dynamics captured by ANM are invaluable for understanding the large‐scale, collective motions of proteins, it is important to recognize its inherent limitations. This model is based on NMA, which operates under the assumption that atomic displacements occur within a harmonic potential. This assumption restricts its ability to explore the full potential energy surface (PES). As a result, ANM primarily samples the immediate vicinity around the equilibrium state, leaving out more complex, functionally relevant motions that may occur in regions of the PES characterized by anharmonicity.
Anharmonic modes encompass movements that extend beyond this harmonic approximation and involve navigating broader, more complex areas of the PES. These areas include features such as energy barriers, non‐linear paths, and multiple local minima. It has been shown that ANM, when complemented by MD simulations, can capture a greater range of functionally relevant dynamics [49]. Furthermore, previous work has enhanced NMA by incorporating anharmonic aspects, sampling an anharmonic potential function along the directions of the eigenvectors of the lowest normal modes [85].
In this work, we have used classic ANM for our dynamics calculations because many of the structures are too large and contain metals that require specific parameters to model accurately. However, it would be worthwhile in future work to consider the impact of anharmonicity on homology detection via methods such as ANMA. Doing so would provide us with a more comprehensive understanding of the dynamics and improve the prediction of functionally relevant conformational changes.
Author Contributions
Jan A. Siess: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, writing – original draft, writing – review and editing, visualization. Vikas Nanda: conceptualization, methodology, project administration, resources, writing – original draft, writing – review and editing, funding acquisition, investigation, supervision.
Conflicts of Interest
The authors declare no conflicts of interest.
Peer Review
The peer review history for this article is available at https://www.webofscience.com/api/gateway/wos/peer‐review/10.1002/prot.70004.
Supporting information
Data S1. Supporting Information.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1J. E. Goldford and D. Segrè , “Modern Views of Ancient Metabolic Networks,” Current Opinion in Systems Biology 8 (2018): 117–124.
- 2M. Bashton and C. Chothia , “The Generation of New Protein Functions by the Combination of Domains,” Structure 15, no. 1 (2007): 85–99, 10.1016/j.str.2006.11.009.17223535 · doi ↗ · pubmed ↗
- 3J. A. Gerlt and P. C. Babbitt , “Divergent Evolution of Enzymatic Function: Mechanistically Diverse Superfamilies and Functionally Distinct Suprafamilies,” Annual Review of Biochemistry 70 (2001): 209–246, 10.1146/annurev.biochem.70.1.209.11395407 · doi ↗ · pubmed ↗
- 4F. Baymann , E. Lebrun , M. Brugna , B. Schoepp‐Cothenet , M. T. Giudici‐Orticoni , and W. Nitschke , “The Redox Protein Construction Kit: Pre‐Last Universal Common Ancestor Evolution of Energy‐Conserving Enzymes,” Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 358, no. 1429 (2003): 267–274, 10.1098/rstb.2002.1184.12594934 PMC 1693098 · doi ↗ · pubmed ↗
- 5A. K. Garcia , D. F. Harris , A. J. Rivier , et al., “Nitrogenase Resurrection and the Evolution of a Singular Enzymatic Mechanism,” e Life 12 (2023): 1–18, 10.7554/e Life.85003.PMC 997727636799917 · doi ↗ · pubmed ↗
- 6A. K. Garcia , B. Kolaczkowski , and B. Kaçar , “Reconstruction of Nitrogenase Predecessors Suggests Origin From Maturase‐Like Proteins,” Genome Biology and Evolution 14, no. 3 (2022): evac 031.35179578 10.1093/gbe/evac 031PMC 8890362 · doi ↗ · pubmed ↗
- 7S. L. Schwartz , A. K. Garcia , B. Kaçar , and G. P. Fournier , “Early Nitrogenase Ancestors Encompassed Novel Active Site Diversity,” Molecular Biology and Evolution 39, no. 11 (2022): msac 226.36260513 10.1093/molbev/msac 226PMC 9641968 · doi ↗ · pubmed ↗
- 8L. A. David and E. J. Alm , “Rapid Evolutionary Innovation During an Archaean Genetic Expansion,” Nature 469, no. 7328 (2011): 93–96.21170026 10.1038/nature 09649 · doi ↗ · pubmed ↗
