Direct Phasing of Protein Crystals with Continuous Iterative Projection Algorithms and Refined Envelope Reconstruction
Yang Liu, Ruijiang Fu, Wu-Pei Su, Hongxing He

TL;DR
This paper introduces new algorithms and methods to improve the accuracy of solving protein crystal structures with limited solvent content.
Contribution
The study introduces continuous iterative projection algorithms and a refined envelope reconstruction method to enhance direct phasing of protein crystals.
Findings
The refined envelope scheme increased success rates of continuous algorithms by 45.7% and classical algorithms by 60.5%.
Integrating a genetic algorithm improved success rates by 2.5-fold and reduced phase errors by 6.83°.
The new methods enabled atomic models with backbone RMSD typically below 0.5 Å compared to PDB structures.
Abstract
Direct methods provide a model-free approach to solving the crystallographic phase problem and deliver unbiased atomic structures. However, conventional iterative projection algorithms such as Hybrid Input–Output (HIO) face two critical challenges: discontinuous density modification at the protein-solvent boundary and inaccurate molecular envelope reconstruction that fails to account for trapped solvent, particularly in crystals with solvent content approaching the lower limits of direct phasing applicability. We introduced four continuous iterative projection algorithms, including our improved continuous version, which implements smooth density modification at protein-solvent interfaces. To address envelope inaccuracy, we developed a two-step refined reconstruction scheme using sequential large-radius and small-radius Gaussian filters to identify trapped solvent molecules within…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnzyme Structure and Function · Protein Structure and Dynamics · Machine Learning in Materials Science
1. Introduction
Protein crystallography remains the primary method for determining three-dimensional structures of biological macromolecules through X-ray diffraction. While diffraction experiments record structure factor amplitudes, the phase information is inherently lost, creating the phase problem. Conventional structure determination approaches, such as molecular replacement, depend on AlphaFold-predicted models [1,2,3] or homologous structures, which may introduce model bias, while experimental phasing methods require heavy atom derivatization and considerable experimental investment. In contrast, direct methods operate independently of prior structural information, offering an unbiased route to structure determination. This makes them valuable for validating predicted structures and for determining novel protein folds through ab initio approaches. Direct methods iteratively enforce physical constraints in real space, such as uniform bulk solvent density and characteristic protein density distribution, while simultaneously satisfying experimental diffraction data in reciprocal space until convergence to the true electron density is achieved.
Direct phasing methods have evolved from small-molecule techniques to macromolecular approaches. Early methods for small molecules exploited statistical relationships in diffraction data, including Sayre’s equation [4], triplet phase relationships [5], tangent formula [6,7], and minimal principle [8], implemented in software such as SHELX [9]. However, these classical techniques require atomic-resolution data better than 1.2 Å and are limited to structures containing fewer than 1000 non-hydrogen atoms [10], rendering them unsuitable for most protein crystals that diffract to resolutions around 2 Å and contain thousands of non-hydrogen atoms. Traditional methods could only provide low-resolution structural information for macromolecules [11]. The breakthrough came with recognizing that additional physical constraints must be maximally exploited, the most powerful being the nearly constant electron density of bulk solvent regions in protein crystals. This led to the development of iterative projection algorithms (IPAs) designed to leverage this constraint.
IPAs are primary computational tools for the direct solution of macromolecular structures by utilizing constraints in both real and reciprocal space. Seminal contributions include the Hybrid Input–Output (HIO) algorithm by Fienup [12] and the Difference Map (DM) algorithm by Elser [13,14], both applied to ab initio protein structure determination. Millane et al. demonstrated using HIO for phasing periodic structures [15,16,17]. Miao et al. and Marchesini et al. demonstrated using HIO for phasing non-periodic structures [18,19]. Lunin et al. explored the use of density connectivity for searching low-resolution structures [20]. Liu et al. have shown that given a reasonable molecular envelope, HIO can recover atomic-resolution protein structures [21]. Further developments include automated envelope generation for HIO by He and Su [22], the application of DM to proteins and viruses by Lo and Millane [23,24,25], workflows incorporating envelope clustering by Kingston et al. [26,27], and partial structure completion with machine-learning models by Pan et al. [28]. Our previous research introduced ab initio non-crystallographic symmetry (NCS) averaging [29], the transition-region-based THIO algorithm [30], and advanced strategies including resolution-weighted phasing [31] and genetic algorithm co-evolution [32]. Despite these advances, existing IPA-based methods remain largely restricted to structures with solvent content exceeding 65% [32]. However, statistical analysis of the Protein Data Bank (PDB) reveals that only 9.6% of deposited structures exceed this threshold (see Appendix A, Figure A1 for the distribution of solvent content). We categorize crystals as low solvent content (<45%), medium solvent content (45–55%), or high solvent content (>55%), with each category comprising approximately one-third of PDB-reported crystal structures. This distribution highlights the need for methodological improvements to extend applicability toward lower solvent content within the high-solvent range. Two fundamental limitations of current IPAs restrict their broader application to lower-solvent-content and structurally complex systems.
The first limitation arises from discontinuous density modification at the protein-solvent boundary. Traditional IPAs apply different update rules to protein regions, where density is preserved or adjusted through histogram matching [33], versus bulk solvent regions, where strong negative feedback forces density toward zero. This binary treatment imposes an abrupt mathematical discontinuity at the molecular boundary that contradicts the continuous nature of electron density, propagating errors throughout iteration and hindering convergence. The second limitation concerns inaccurately reconstructed molecular envelopes. Starting from random phases, precisely delineating the true protein boundary during early iterations is difficult. Consequently, researchers employ loose, oversized envelopes to ensure complete protein enclosure. However, this approach encompasses substantial volumes of discrete solvent molecules residing within surface cavities and internal channels. Conventional algorithms mistakenly treat these trapped solvent regions as protein density, failing to apply the powerful constant-density constraint these regions would otherwise provide, representing a waste of valuable prior information. When the overall solvent content is already limited, this inefficiency becomes detrimental to successful phase recovery.
To address these challenges, we developed a framework that enhances phase recovery capability and reliability through innovations at three levels. At the algorithmic level, we introduce continuous IPAs, including established algorithms Continuous Hybrid Input–Output (CHIO) [34], Hybrid Projection Reflection (HPR) [35], and Transition Hybrid Input–Output (THIO) [30], along with our improved variant, Modified CHIO (MCHIO). These algorithms establish transition zones between protein and solvent regions, enabling smooth density modification. At the methodological level, we propose a two-step refined envelope reconstruction scheme. This scheme employs a coarse-to-fine strategy using sequential large-radius and small-radius Gaussian filters to identify and recover solvent erroneously included within coarse envelopes. By incorporating recovered solvent into constrained regions, the scheme maximizes utilization of available information. This methodology enhances performance for both continuous and classical IPAs, including HIO, DM, and our improved versions, Modified Difference Map (MDM) and Modified Relaxed Averaged Alternating Reflections (MRAAR). At the strategic level, we performed optimization comparing three phasing strategies: conventional full-resolution, resolution-weighted progressive [31], and genetic algorithm (GA) co-evolution [32]. The GA strategy establishes information-sharing and inheritance mechanisms among multiple independent reconstruction processes, achieving enhancement in global search capability that elevates success rates.
This work makes several contributions. We provide the first comparison and validation of multiple continuous IPAs for protein direct phasing, establishing their advantages and practical performance. We introduce the two-step refined envelope reconstruction scheme as a method of elevating baseline performance across various algorithms. We develop improved iterative algorithms, including MDM, MCHIO, and MRAAR. Through comprehensive testing on 28 protein structures with diverse space groups, solvent contents ranging from 55% to 78%, resolutions spanning 1.46 to 3.2 Å, and PDB-reported R-work below 0.22, we quantify performance gains under different strategies. For structures with favorable conditions, accurate diffraction data, and suitable molecular packing, our methods successfully phase structures and extend applicability to the lower boundary of the high-solvent-content range. Linear regression analysis suggests the potential applicability boundary reaches approximately 55% solvent content, compared with traditional approaches that typically require solvent contents above 65%. This extends the accessible pool from 9.6% (structures > 65%) to potentially 32.3% (structures > 55%) of PDB structures (Appendix A, Figure A1) when data quality and structural characteristics permit, though success rates decrease substantially as solvent content approaches this lower limit.
While the methods presented extend the practical boundaries of direct phasing, we acknowledge their scope and limitations. Success remains dependent on adequate solvent content and sufficient data quality. Although the GA strategy enhances success rates, it requires greater computational resources, and its effectiveness depends on at least one individual achieving convergence. This does not eliminate the dependence of phase recovery on solvent content constraints. For crystals with solvent content approaching or below 55%, integration with higher-quality diffraction data, complementary experimental information, or additional physical constraints will likely remain necessary. Nevertheless, this work provides more powerful computational tools and mechanistic understanding for addressing challenging crystallographic problems, advancing model-free phase retrieval methods toward broader practical applications in structural biology.
2. Materials and Methods
2.1. Overall Workflow and Key Operations in Direct Phasing
Direct phasing in protein crystallography cyclically enforces constraints in both real space and reciprocal space through iterative alternation between these two domains. Figure 1 illustrates the workflow of this process. The algorithm begins with structure factor amplitudes obtained from experimentally measured diffraction intensities, where denotes the reciprocal lattice indices. Since phase information is unavailable from the experiment, an initial set of random phases is generated and combined with the observed amplitudes to construct the initial complex structure factors: .
The algorithm then enters an iterative loop where constraints are applied to improve the phase estimates. At the k-th iteration, the current electron density is first subjected to reciprocal space constraints. This begins with a Fourier transformation converting the real-space density into reciprocal space:
The experimental amplitude constraint is enforced by replacing the calculated amplitude with the observed amplitude while preserving the calculated phase , yielding an updated structure factor:
This amplitude replacement ensures that the electron density satisfies the experimental diffraction data at each iteration. The modified structure factor is then transformed back to real space through inverse Fourier transformation, producing an updated density:
where V represents the unit cell volume, and denotes the real-space coordinate vector.
Following reciprocal space operations, constraints are applied in real space through a procedure comprising two components: molecular envelope reconstruction and density modification based on the reconstructed envelope. The envelope reconstruction step aims to identify the approximate spatial region occupied by the protein molecule, the envelope domain S, from the current electron density , which may still contain inaccuracies. This is achieved by computing a weighted average density through Gaussian low-pass filtering of :
In this expression, denotes a Gaussian kernel with standard deviation that smooths local density fluctuations and emphasizes the contiguous regions of macromolecular structure. Using the estimated solvent content derived from the Matthews coefficient [36] based on unit cell parameters and protein molecular weight, a threshold value is determined from the distribution of values throughout the unit cell. Regions satisfying the condition are then assigned to the protein region S, although this initial envelope remains coarse. Depending on whether a classical or continuous iterative projection algorithm is employed, a transition region T may be defined at the protein-solvent interface. Once the envelope domain S for the current iteration has been established, which may include the transition region T for continuous algorithms, the density modification rules of the chosen algorithm are applied to to generate the updated density for the subsequent iteration. This density modification step is the core mathematical operation of the algorithm and enforces physical constraints: the protein region density should conform to expected statistical distributions, achieved through histogram matching [33], while the solvent region density should approach a constant value [37].
The iterative cycle repeats for several thousand iterations until convergence is achieved. Throughout this process, multiple quality metrics are computed to monitor progress and evaluate convergence. These include the working R-factor , the free R-factor , the density deviations in protein and bulk solvent regions, and, when a reference structure is available, the mean phase error and the intersection-over-union ratio of the envelope. Successful convergence is characterized by the reduction of both and to stable low values, accompanied by improvements in the other monitored metrics. The result of this iterative refinement is an electron density map representing the protein structure. This density map can be utilized by automated model-building software packages such as ARP/wARP version 8.0 [38,39], Buccaneer [40] from the CCP4 software suite version 9 [41], or Phenix (version 1.16-3549) AutoBuild [42,43] to construct and refine the atomic model, producing a three-dimensional protein structure.
2.2. Mathematical Framework and Continuous Modification of Iterative Projection Algorithms
The mathematical foundation of iterative projection algorithms centers on identifying a solution that satisfies two constraint sets. In crystallographic direct phasing, these are the real-space constraints , which encompass the statistical characteristics of protein density distribution and the uniform nature of solvent regions, and the reciprocal-space constraints , which represent the experimentally measured diffraction amplitudes. The objective is to refine the electron density through repeated application of projection operators and until it converges to a state satisfying both constraint sets.
2.2.1. Partitioned Update Form of Classical Iterative Projection Algorithms
Classical iterative projection algorithms such as the Hybrid Input–Output (HIO) algorithm are implemented using a partitioned update strategy adapted to protein crystals. This approach relies on the premise that at the k-th iteration, a molecular envelope domain can be estimated from the current electron density . The projection operator incorporates the following: within this envelope domain , constraints appropriate to protein density are enforced through histogram matching procedures, while in the region outside corresponding to the bulk solvent, a constant-density constraint is applied through solvent flattening operations.
The HIO algorithm [12] exemplifies this partitioned approach, with its discretized form for protein crystallography expressed as
where represents the negative feedback factor applied in the solvent region, typically assigned values between 0.7 and 0.9, with used throughout this study. In this formulation, denotes the reciprocal-space projection corresponding to the amplitude replacement operation defined in Equation (2), while represents the real-space constraint operations that include histogram matching within the protein region and solvent flattening in the solvent region. We have inserted into the partitioned form of HIO. This formulation reveals the origin of the discontinuity problem: at the envelope boundary, the density update rule abruptly transitions from inside the protein region to in the solvent region, potentially introducing a discontinuous step change in the updated density across this boundary.
2.2.2. Continuous Iterative Projection Algorithms: Introduction of a Transition Region
To address the discontinuity problem in classical algorithms, continuous iterative projection algorithms introduce a transition region at the k-th iteration, positioned at the interface between the protein region and the bulk solvent region. This partitioning divides real space into three zones: the protein core region , the interfacial transition region , and the bulk solvent region. The objective of these continuous algorithms is to implement a smooth update strategy within the transition region that provides gradual and continuous modulation between the density modification rules applied in the protein and solvent regions. We implemented and evaluated four continuous algorithms that achieve this objective through different mathematical approaches. We have updated the original forms of those algorithms to make them suitable for protein crystallography.
The Continuous Hybrid Input–Output (CHIO) algorithm [34] extends the framework of HIO by introducing intermediate feedback behavior in the transition region. Its update rule is formulated as
where the transition feedback parameter is defined as . In the original CHIO formulation [34], the parameter is typically set to 0.4, which yields and results in negative feedback that is stronger in the transition region than in the bulk solvent region where .
The Hybrid Projection Reflection (HPR) algorithm [35] employs a reflection-projection framework and can be expressed in partitioned form as
where is an algorithm-specific parameter set to 0.588 in the original formulation [35]. In typical partitioned implementations, the HPR algorithm applies the same feedback factor uniformly across both the transition region and the bulk solvent region, with its continuity manifesting through consistent treatment of the entire region outside the protein core .
Based on our analysis of CHIO and HPR algorithms, we developed the modified CHIO (MCHIO) algorithm to address a limitation. We observed that the relatively strong feedback in CHIO’s transition region, with , might be excessive when contains portions of protein side-chain density, potentially leading to inappropriate suppression of structural features. To address this, we propose a modified update rule:
where represents the adjusted feedback factor for the transition region. To achieve more appropriate behavior that is gentler than bulk solvent treatment while effectively leveraging the solvent constraint, we set throughout this study. This value is chosen to be lower than both and , thereby implementing a gentler, more conservative density modification in the transition region compared with the bulk solvent. This adjustment aims to avoid excessive suppression of potential protein density features that might be incorrectly assigned to the transition region , while still effectively applying the solvent-like constraint.
The Transition Hybrid Input–Output (THIO) algorithm, introduced in our previous research [30], implements a density-weighted assignment approach within the transition region. Its updated formulation is
where the weight function takes values in and is computed based on the weighted average density at each grid point using a small Gaussian kernel radius of approximately Å. This weight represents the local probability that a given position belongs to the protein rather than solvent. Through this linear interpolation scheme, THIO achieves continuous and physically motivated density modulation throughout the transition region, with the update rule smoothly varying based on local density characteristics.
The fundamental concept unifying all these continuous algorithms is the mathematical refinement of electron density update behavior across the molecular boundary to achieve smoother transitions. This enhancement is designed to improve numerical stability and increase the likelihood of convergence when dealing with ambiguous boundaries and limited solvent content. Figure 2 provides a schematic comparison of the update rules employed by these algorithms, illustrating their distinctive characteristics and the progressive refinement of boundary treatment strategies.
2.2.3. Classical Iterative Projection Algorithms and Improved Variants
The two-step refined envelope reconstruction scheme demonstrated in the subsequent section exhibits universal applicability, providing performance enhancement for both continuous and classical iterative projection algorithms. To establish a comprehensive algorithmic framework, we evaluated classical iterative projection algorithms, including HIO, DM, ASR, and RAAR, and developed improved variants, MDM and MRAAR, optimized for protein crystallography. The HIO algorithm has been presented in the previous section, following the update rule in Equation (5).
The Difference Map (DM) algorithm [13] generates two candidate solutions at the k-th iteration:
These solutions satisfy real-space constraint set and reciprocal-space constraint set , respectively. The DM iteration formula is
where is the relaxation parameter (typically 0.5 to 1.0, set to 0.75 in this study) controlling convergence speed and search capability. Convergence is achieved when the two candidate solutions become equal. Considering partitioned updates for protein region and solvent region, the DM formula can be expressed as
Experimental results revealed that DM exhibits low success rates for protein phase retrieval, primarily due to insufficient constraint enforcement within the protein region during iteration. To address this, we propose an improved variant that explicitly applies constraint operators and to the protein region:
This improved formulation, designated Modified DM (MDM), strengthens constraint application in the protein region and enhances phase retrieval performance.
The Averaged Successive Reflections (ASR) algorithm [44] employs reflection operators and , where I is the identity operator. The ASR update rule is
Substituting the reflection operators and considering partitioned updates yields
The Relaxed Averaged Alternating Reflections (RAAR) algorithm [45] introduces a relaxation parameter (typically 0.95) with the update rule:
In partitioned form, this becomes
Both ASR and RAAR exhibited low success rates in our experiments, attributable to insufficient constraint enforcement in the protein region. We developed an improved variant by setting , which reduces RAAR to the ASR formulation, then explicitly applying operators to the protein region:
This improved algorithm, designated Modified RAAR (MRAAR) or Modified ASR (MASR), enhances performance. This formulation is mathematically equivalent to the Hybrid Difference Map - formula 1 (HDM-f1) algorithm developed in our concurrent work [46], which was derived from a different starting point through modification of the DM framework, representing a convergence of approaches toward optimal algorithm design. The improved classical algorithms, when combined with the two-step refined envelope reconstruction scheme, expand the available toolkit for direct phasing of protein crystals.
2.3. Envelope Reconstruction Strategies: From One-Step Coarse Design to Two-Step Refined Design
The effectiveness of iterative projection algorithms, their capacity to enforce constraints within solvent regions, depends on the accuracy of the reconstructed protein molecular envelope. Traditional envelope reconstruction employs a single-step approach in which the k-th iteration’s electron density is convolved with a Gaussian kernel, usually featuring a large radius such as Å, to produce a smoothed weighted average density map . Based on the estimated solvent content derived from the Matthews coefficient [36], a threshold value is determined from the statistical distribution of values. Grid points satisfying the condition are classified as belonging to the protein region, while the remaining volume is the solvent region. This method offers computational simplicity and can rapidly delineate the general molecular shape during early iteration stages. However, it suffers from limitations that become problematic for challenging structures. The large-scale smoothing operation erases fine structural details at the molecular surface, resulting in boundaries that are rough and imprecise. The resulting coarse envelope tends to be over-expanded to ensure complete enclosure of the protein molecule, which incorporates volumes of solvent molecules that are situated within surface crevices, pockets, and internal channels. In classical iterative algorithms, these incorrectly assigned solvent regions are treated as protein density throughout the modification process, failing to exploit the constant-density constraint that these regions would otherwise provide. This inefficient use of available information becomes detrimental when the overall solvent content is limited.
To overcome these deficiencies and maximize the utilization of all available solvent constraints within the crystal unit cell, we developed a two-step refined envelope reconstruction scheme. The concept underlying this scheme is a strategy that proceeds from coarse localization to fine-scale local refinement. This approach provides a more physically rational foundation for defining the transition region in continuous algorithms and functions as a universal technique capable of enhancing the performance of all iterative projection algorithms.
The first step focuses on coarse envelope generation following an approach similar to the traditional single-step method. A Gaussian kernel with a large radius, set to Å in this study, is applied to smooth the k-th iteration’s electron density , yielding a globally smoothed density distribution . Using the estimated solvent content , a threshold value is determined to define an initial coarse envelope that serves as the protein candidate region:
The objective of this initial step is to capture the protein’s center of mass and overall molecular shape while avoiding artificial fragmentation of the envelope that could arise from localized density fluctuations. By employing a large smoothing kernel, this step ensures robust identification of the main protein volume even when the current density estimate contains noise or systematic errors.
The second step is the refinement stage that distinguishes our scheme from traditional approaches. Here, a second smoothing operation is performed on the same electron density using a Gaussian kernel with a smaller radius, set to Å in this work, which produces a locally refined density map that preserves fine-scale density variations. All subsequent operations in this step are confined to the interior of the coarse envelope established in the first step. Within this domain, the values of at all grid points are ranked by magnitude, and grid points in the lowest approximately 5% (of the asymmetric unit) are identified as transition region , most likely to represent trapped solvent rather than protein density. This 5% threshold was determined empirically from our benchmark set and proved robust across diverse protein structures. For truly novel structures without homologous or predicted references, this value can be adjusted within a range of 3–7% through systematic trials, as the method shows tolerance to moderate variations in this parameter. The identified low-density regions are then removed from the protein candidate volume to yield refined spatial assignments. We define the refined protein core region as , representing the volume that remains after removing the identified solvent-like regions from the coarse envelope. The refined transition region is defined as
The bulk solvent region comprises both the external volume lying outside and the internal stripped region , reclassified based on low local density. The transition region identifies trapped solvent molecules within surface cavities and internal channels, enabling the application of solvent-like constraints during iteration. This enhances convergence for crystals with limited solvent content by maximizing utilization of available constraint information. However, although physically consisting of trapped solvent, the transition region exhibits slightly non-uniform density due to proximity to the protein surface. Negative feedback applied during iteration drives density toward uniform values, deviating from actual surface solvent density. This slightly non-uniform density is incompatible with the strict uniform-density requirements of comprehensive solvent flattening [37] in final-stage refinement. Therefore, in the last 2000 iterations, the transition region is linearly reduced from 5% (of the asymmetric unit) to zero over 1500 iterations, followed by 500 iterations of solvent flattening applied uniformly to all trials. Removing the transition region eliminates density constraint errors from surface-proximal solvent and enables optimal uniform-density enforcement in true bulk solvent regions.
This two-step scheme offers three advantages over traditional single-step methods. First, the small-radius filter is sensitive to local low-density features corresponding to surface cavities and internal channels, enabling accurate identification of trapped solvent. Second, the scheme maximizes constraint utilization by identifying trapped solvent within the coarse envelope. Although the transition region is spatially adjacent to protein, it exhibits low-density characteristics of solvent. During iteration, this region is treated with solvent constraints, such as negative feedback in continuous algorithms, increasing the effective constraint volume. As discussed above, the transition region should be removed in the final refinement stage to enable strict solvent flattening. Third, for continuous algorithms, the transition region delineated through refined local density analysis is less likely to contain protein side-chain density compared with regions defined by traditional single-step approaches, improving the purity and effectiveness of applied constraints.
Section 3.3 provides a direct visual comparison using protein structures 3rd5 [47] and 2fg0 [48] as examples, demonstrating that envelopes generated by our two-step method reconstruct finer and more accurate molecular surface details compared with those produced by the traditional one-step approach. Although multi-step refinement with progressively decreasing kernel sizes was tested, we found that the two-step approach represents an optimal balance between envelope accuracy and computational efficiency. Additional refinement steps beyond two did not improve phasing success rates, as both one-step and two-step envelopes represent approximations to the true molecular boundary, and the two-step scheme already provides sufficient accuracy for iterative projection algorithms to converge to correct solutions.
2.4. Phase Retrieval Strategies: From Full-Resolution to Genetic Co-Evolution
We implemented and compared three phasing strategies to search the solution space for correct phases. These strategies optimize data utilization and search organization to enhance success rate and computational efficiency.
2.4.1. Full-Resolution Phasing Strategy
The full-resolution approach represents the fundamental strategy. All available experimental diffraction data, excluding only a reserved free set for cross-validation, participate equally in the reciprocal-space amplitude constraint at every iteration. This strategy applies no preprocessing or weighting to diffraction data, requiring the algorithm to simultaneously satisfy constraints spanning the entire resolution range. The advantages are simplicity and straightforward implementation. However, for high-dimensional optimization problems, requiring immediate fitting of structural information across all spatial scales can create convergence difficulties and entrapment in local minima for structures with limited solvent content.
2.4.2. Resolution-Weighted Progressive Phasing Strategy
The resolution-weighted progressive strategy implements a hierarchical data utilization scheme prioritizing low-resolution diffraction information during initial iterations, then introducing high-resolution data as convergence progresses. This is implemented through a time-dependent weighting function applied to observed structure factor amplitudes:
where represents the reciprocal of resolution spacing, and is a time-varying parameter controlling filter bandwidth. Initially, is set to 0.8∼1.0 Å, attenuating high-resolution contributions. As iterations progress, is reduced toward zero following a predetermined annealing schedule [31]. This progressive expansion from low to high spatial frequencies follows the logic of first establishing global protein fold before resolving atomic positions, facilitating the establishment of correct low-resolution phase relationships that provide a stable foundation for convergence.
2.4.3. Genetic Algorithm-Enhanced Co-Evolution Phasing Strategy
To overcome limitations of independent random-start searches, we implemented a genetic algorithm co-evolution strategy using population-based intelligence. This strategy treats multiple parallel phase retrieval processes, typically 100 individuals, as an evolutionary population, with each individual’s electron density map representing an organism. By simulating biological evolution through selection, crossover, and mutation, this approach establishes information exchange mechanisms enabling high-quality density features to propagate throughout the population, enhancing global search capability.
The workflow proceeds through interconnected stages. Population initialization generates N individuals, each with random phases ensuring diversity. During independent evolution, all individuals execute a predetermined number of iterations using standard iterative projection algorithms in parallel, employing envelope reconstruction schemes and resolution-weighted strategies. This phase is a local search within each individual’s solution space neighborhood. Periodically, typically every 100 iterations, the algorithm performs evaluation and selection. Individual fitness is quantified through
where and represent the best and the average within the population, and . Higher fitness individuals have greater probability of genetic operations.
Genetic operations implement two mechanisms. During crossover, two parents are selected according to fitness-weighted probabilities. The asymmetric unit is divided into spatial blocks, and a chosen subset undergoes density value exchange between parents to generate offspring. Prior to crossover, all density maps undergo rotational and translational alignment to eliminate crystallographic origin and enantiomorphic ambiguities. Mutation operations introduce stochasticity by randomly selecting approximately 1% of grid points in offspring density maps and assigning new random values, introducing variation, and preventing premature convergence. The resulting offspring typically exhibit improved fitness relative to the parent population. Population update then replaces low-fitness individuals with these offspring, forming the next generation. The algorithm then returns to independent evolution for another round of local search, followed by global information exchange. More details, such as elite inheritance and similarity punishment preventing prematurity, can be found in our previous paper [32].
The effectiveness of this strategy derives from synergistic effects between local refinement and global information sharing. Once any individual approaches the correct solution, manifested as a sharp R-factor decrease, high-quality density features disseminate throughout the population via crossover. This collective guidance enables population convergence toward the global optimum at rates exceeding independent random searches. The GA strategy upgrades traditional multi-start independent searching to intelligent multi-start cooperative searching with active information sharing, representing an advancement for enhancing the reliability of direct-method phase retrieval in challenging applications.
2.5. Error Metrics, Missing Reflections, and Model Building
We employed a validation pipeline with quality metrics to monitor convergence, strategies to handle incomplete data, and automated model building from recovered density maps.
2.5.1. Error Metrics
Assessment metrics are organized into two categories: reference-dependent metrics for validation during method development and internal consistency metrics applicable to de novo structure determination. Reference-dependent metrics provide validation when PDB coordinates are available. The mean phase error quantifies the angular deviation between retrieved and true phases:
where denotes averaging over unique reflections in the working set. The envelope intersection-over-union metric assesses spatial accuracy of the reconstructed molecular boundary:
quantifying the overlap ratio between reconstructed envelope domain and true envelope domain .
Internal consistency metrics serve as criteria for evaluating convergence when reference coordinates are unavailable. The working R-factor , defined in Equation (24), and free R-factor measure agreement between calculated and experimental amplitudes:
where is computed using a randomly selected subset (1% of reflections in this study), excluded from all iterative processes as an independent validation dataset. Successful convergence is characterized by reduction of both and to stable low values. Density convergence indicators monitor the magnitude of real-space density modifications. For HIO-type algorithms, we compute for grid points . For MRAAR-type algorithms, we compute
where denotes averaging over grid points in each region. As convergence proceeds, these update magnitudes diminish toward zero, indicating stable density satisfying constraints.
2.5.2. Handling of Missing and Weak Diffraction Data
Diffraction datasets contain missing reflections due to various factors: reflections in the free set, low-resolution reflections (below 15 Å), and weak reflections failing signal-to-noise criteria ( ). To handle such incomplete data, we implemented a dynamic mixing strategy. For missing or weak reflections, calculated amplitude is blended with experimental values using weight factor :
where is a global scale factor. For missing data, ; for weak data, . This strategy ensures reciprocal-space data completeness, preventing information loss and projection artifacts from data gaps.
2.5.3. Electron Density Map Post-Processing and Automated Model Building
After convergence, we applied post-processing to enhance density map quality. When multiple solutions converged successfully through GA strategy, we performed spatial alignment and density averaging to reduce noise. The resulting electron density map was used for automated model building, which identifies secondary structure elements, constructs polypeptide backbones, and positions side-chain atoms based on density features and stereochemical constraints. The initial model exhibits over 80% residue completeness and is refined through several iterations using phenix.refine [49], optimizing atomic coordinates and displacement parameters against experimental data to yield the final structural model.
2.6. Test Datasets, Computational Implementation, and Parameter Settings
To ensure robustness, we assembled a benchmark dataset of 28 protein crystal structures with diversity in space group symmetry, solvent content, and diffraction resolution. All diffraction data ( ) and reference coordinates were obtained from the Protein Data Bank. The structures comprise proteins containing 826 to 7475 non-hydrogen atoms with 0 to 635 bound water molecules. The dataset includes resolutions from 1.46 Å to 3.2 Å, solvent contents from 55% to 78%, PDB-reported R-work values below 0.22 (ranging from 0.133 to 0.215), and diverse space groups including , , , , , , , , , , , , and , ensuring evaluation under varied crystallographic conditions. Diffraction datasets contain 4918 to 76,249 observed reflections, with missing low-resolution reflections ranging from 3 to 236. Detailed descriptions of all test structures are provided in Appendix A, Table A1. Solvent content for each structure was estimated using the matthews_coef program from the CCP4 suite [41], based on the Matthews coefficient [36]. The calculations incorporated unit cell parameters, space group symmetry, and protein molecular weight derived from the amino acid sequence. This estimated solvent content, along with space group information and unit cell parameters, constitutes essential prior knowledge used as constraint information during iterative phasing.
The computational framework was implemented using the Clipper C++ libraries for crystallographic computing [50] combined with MPI (Message Passing Interface) for parallel processing. Key parameters were standardized: electron density maps were discretized on 3D grids with approximately 1.0 Å spacing; Gaussian filter radii were Å (large-radius) and Å (small-radius); the refined scheme stripped 5% of grid points in the asymmetric unit from the coarse envelope; the solvent region negative feedback factor was ; continuous algorithm parameters were for CHIO, for HPR, and for MCHIO; GA population size was 100 individuals with genetic operations every 100 iterations, crossover probability 0.5, and mutation probability 0.01.
Real-space histogram matching requires a reference electron density distribution. When available, a homologous structure or AlphaFold-predicted model can serve as a reference; structure factors are computed via phenix.fmodel with bulk solvent correction and resolution-appropriate temperature factors. In computing structure factors from model coordinates, the constant term F(0,0,0) (F000) represents the mean electron density of the unit cell. We adjusted F000 such that bulk solvent regions exhibit zero mean electron density. During density modification iterations, solvent flattening constraints drive bulk solvent densities toward zero through negative feedback, and pre-setting the reference histogram with zero-centered solvent regions ensures consistency between the target distribution and algorithmic behavior.
For optimal histogram generation from simulated diffraction data, temperature factors (B-factors) must be assigned to the atomic model. However, there is no exact theoretical method to determine optimal B-factors for histogram matching applications. Several empirical approaches exist: atomic B-factors can be estimated from the Wilson B-factor [51] calculated from the experimental diffraction data of the unknown structure; alternatively, statistical analysis of deposited PDB structures can provide estimates for both atomic and bulk solvent B-factors [52]. In this work, we employ empirical relationships based on the high-resolution limit (Å). For bulk solvent regions, the B-factor is approximated as . For homologous or AI-predicted protein models, isotropic atomic B-factors are set to . These empirical relationships are applicable for typical macromolecular crystallography resolution ranges ( Å).
With appropriately configured temperature factors and F000 value, electron density histograms generated from homologous or AI-predicted models closely resemble those from experimentally determined structures. Our tests demonstrate that at identical resolution and with appropriately set parameters, electron density distribution histograms remain remarkably similar even across different protein structures, validating their use as reliable constraint conditions for protein region density in unknown structures. This observation is consistent with the conclusions of Zhang and Main’s seminal work on histogram matching [33]. For benchmarking purposes in this study, we generated reference histograms from deposited PDB coordinates using this temperature factor protocol, which does not affect our conclusions, as the histograms provide statistically representative protein density distributions.
For each test structure, envelope scheme (one-step coarse or two-step refined), phasing strategy (full-resolution, resolution-weighted, or GA-enhanced), and iterative algorithm (10 total), we executed 100 independent phase retrieval attempts from different random phase seeds. Maximum iterations per attempt were 10,000, with early termination upon convergence detection (reduction of , , and to stable plateaus). Statistics collected from all 100 attempts included success rate, minimum iteration count for first successful convergence, and median iteration count across successful cases. Calculations were performed on a Dell R740 server with 52 cores (104 threads) at 2.1 GHz. A complete evaluation for a single structure with 100 independent trials required approximately 3 h.
3. Results
3.1. Constraint Framework for Ab Initio Phasing
All direct phasing tests in this study operate within a defined constraint framework that combines experimental measurements with general physical and statistical priors, without requiring knowledge of a specific homologous atomic structure. The reciprocal-space constraint is provided by the experimentally measured structure factor amplitudes , while crystallographic symmetry (space group and unit cell parameters) is obtained directly from the diffraction experiment. The solvent content , estimated via the Matthews coefficient from the unit cell volume and protein molecular weight (derived from sequence information), serves as standard prior knowledge for envelope reconstruction. In real space, two physical constraints are enforced: solvent flattening imposes the near-constant electron density characteristic of bulk solvent regions, and histogram matching guides the protein-region density toward a statistically representative distribution. For genuinely novel structures, this reference histogram can be obtained without model bias either from known proteins [33] or from high-confidence AlphaFold predictions, which provide plausible density distributions without precise atomic coordinates. (In this benchmarking study, reference histograms were generated from deposited PDB coordinates using an empirical temperature factor protocol described in Section 2.6). The molecular envelope itself is reconstructed iteratively from the evolving density, guided by the estimated and local density continuity. This constraint set leverages measurable experimental data and fundamental physical principles, distinguishing our approach from model-dependent methods like molecular replacement that require specific atomic templates. All subsequent results are presented within this framework.
3.2. Validation Case Study: Direct Phasing of Structure 3rd5 with Continuous Algorithms
To validate the effectiveness of continuous iterative projection algorithms in addressing discontinuous density modification at molecular boundaries, we conducted a case study using protein crystal structure 3rd5, a putative uncharacterized protein from Mycobacterium paratuberculesis [47]. With estimated solvent content of 59.52%, this structure represents a challenging case at the lower end of the high-solvent-content range, falling below the 65% threshold above which traditional iterative projection algorithms typically converge, making it suitable for evaluating continuous algorithm capabilities under challenging conditions.
From the continuous iterative projection algorithms, we took HPR as an example and applied it with a full-resolution phasing strategy and two-step refined envelope reconstruction. The experiment consisted of 100 independent phase retrieval attempts, each initialized with random phases and allowed to iterate for up to 10,000 iterations. After 8000 iterations, the transition region was linearly shrunk to zero over 1500 iterations, followed by 500 iterations of solvent flattening to finalize convergence. Figure 3 presents monitoring results across multiple convergence metrics. Each trajectory in Figure 3a–e represents the temporal evolution of key quality indicators for a single independent attempt, including mean phase error, envelope intersection-over-union ratio, working R-factor, free R-factor, and solvent region density deviation. Across 96 of 100 attempts, these metrics fluctuated within high-error regimes without sustained improvement, reflecting the dimensionality and rugged topography of the solution space.
Figure 3 reveals successful convergence events, demonstrating the capability of continuous algorithms via HPR. In four attempts, distinguished from others by bold lines, all five monitored metrics underwent a dramatic transition. Mean phase error plummeted from near 90° to values indicating correct phase determination (Figure 3a). Simultaneously, envelope intersection-over-union jumped from around 0.8 to exceeding 0.9 (Figure 3b). Both working and free R-factors dropped from approximately 0.55 to lower stable values (Figure 3c,d), while solvent region density deviation decreased sharply (Figure 3e). This coordinated improvement across independent quality metrics is the signature of successful convergence to the global optimum. The scatter plot in Figure 3f provides visualization by displaying final R-factor pairs for all 100 attempts. The 96 unsuccessful attempts cluster in the high R-factor region, while the four successful cases appear as outliers in the low R-factor region.
The electron density map from these successful convergences exhibited quality for automated atomic model building. After refinement, the backbone root-mean-square deviation between the resulting model and PDB reference coordinates was only 0.10 Å, confirming high phase accuracy. This case study demonstrates that continuous iterative projection algorithms can recover atomic-resolution structural information for challenging protein crystals with limited solvent content from random phase initialization, despite the 4% success rate. The mechanistic basis lies in the transition region concept, which provides smoother and more physically realistic density modification pathways at the protein-solvent interface. By avoiding abrupt discontinuities of classical algorithms, continuous methods enable more effective navigation through multidimensional solution landscapes.
3.3. Enhancement of Phase Retrieval Performance by the Refined Envelope Scheme
The case study in Section 3.2 demonstrates that continuous iterative projection algorithms can successfully recover phases for challenging structures with limited solvent content when combined with refined envelope reconstruction. To systematically evaluate the impact of envelope quality on phase retrieval performance, we compared the traditional one-step coarse envelope versus our proposed two-step refined envelope reconstruction scheme. We conducted this evaluation using structures 3rd5 [47] and 2fg0 [48] (63.86% solvent), employing CHIO under a full-resolution phasing strategy.
Figure 4a,b provides a visual comparison of transition regions and molecular boundaries generated by the two envelope-reconstruction approaches at equivalent iteration stages for 3rd5 and 2fg0. The traditional one-step method (left panels) produces boundaries that appear smooth but are crude and oversimplified. The transition region resembles a uniform thick shell enveloping the molecular core without discrimination, overlaying portions of protein surface side chains and causing inappropriate application of solvent constraints to protein density features. In contrast, molecular envelopes from the two-step method (right panels) exhibit richer surface detail, conforming closely to true molecular geometry and capturing surface crevices, pockets, and protrusions. The transition region delineated by the refined scheme no longer manifests as a uniform shell but concentrates in spatial regions exhibiting low electron density situated on or near the molecular surface. This selective localization demonstrates identification of physical solvent spaces erroneously included within traditional coarse envelopes.
The envelope precision improvement translates into enhanced convergence behavior. Figure 4c illustrates transition region evolution for structure 3rd5 using CHIO with refined envelope scheme, comparing pre-convergence and post-convergence states. While the pre-convergence transition region remains diffuse, upon successful convergence, the transition region becomes aligned with the final molecular boundary, reflecting confidence in boundary determination. Figure 4d illustrates the progressive refinement of the protein envelope, showing the transformation from an incomplete envelope that fails to fully encompass the protein structure (with numerous atoms exposed outside the boundary) to a well-converged envelope that accurately conforms to the protein surface.
The advantage of enhanced envelope accuracy is the maximization of available constraint information within the unit cell volume. By re-identifying approximately 5% of the asymmetric unit within traditional coarse envelopes as solvent or transition region based on local density characteristics, the refined scheme provides iterative algorithms with additional reliable prior knowledge guiding convergence. For challenging crystals like 3rd5, where total bulk solvent volume is limited, this reclaimed solvent represents valuable information, strengthening the constraint framework. These enhancements demonstrate that refined envelope reconstruction, by optimizing molecular boundary determination, elevates phase retrieval capability. This represents progress toward solving challenges of phasing structures with limited solvent content, historically resistant to direct methods. As demonstrated through benchmarking in subsequent sections, performance gains prove universal, enhancing both continuous and classical iterative projection algorithms across diverse structural types and crystallographic conditions.
3.4. Systematic Benchmarking: Synergistic Effects of Algorithms, Envelopes, and Strategies
We conducted systematic benchmarking across twenty-eight diverse protein structures, comparing ten iterative projection algorithms (six classical or improved: ASR, RAAR, DM, MDM, MRAAR, HIO; four continuous: CHIO, HPR, MCHIO, THIO) under two envelope schemes (one-step coarse and two-step refined) and three phasing strategies (conventional full-resolution, resolution-weighted, genetic algorithm-enhanced). For each configuration, 100 independent phase retrieval attempts from random phase seeds were executed, recording success rate, minimum iteration count for the first convergence, and median iteration count for all successful runs. Detailed numerical results are provided in Appendix A, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7.
3.4.1. Universal Performance Enhancement from Refined Envelope Reconstruction
To contextualize the averaged performance comparisons, we first examine the success rate of two representative algorithms, the classical HIO and the continuous HPR, under the conventional full-resolution phasing strategy with one-step coarse and two-step refined envelope reconstruction schemes across all 28 test structures (Figure 5a). The results, ordered by solvent content, reveal a general upward trend but with significant variation. Structures like 1xhd [53] achieve near-perfect success, while others, such as 4qb6 [54], 3rd5 [47], 4q82 [55], 2fg0 [48], 2bke [56], and 4tpl [57], show very low or zero success rates. This preliminary view highlights the inherent difficulty of certain structures and motivates the need for both improved algorithms and enhanced strategies, the effects of which are analyzed in the following averaged results.
We conducted systematic benchmarking across the 28 diverse protein structures, comparing 10 iterative projection algorithms. Figure 5b–d demonstrates that refined envelope reconstruction delivers universal enhancement. As shown in Figure 5c, continuous algorithms improved from a 14.3% to a 20.9% success rate (45.7% relative gain), with reduced minimum and median iteration counts. For classical algorithms (excluding ASR and RAAR), the improvement was more striking: from 9.9% to 15.9% (60.5% relative gain). In Figure 5b, the proposed improved algorithms MDM, MRAAR, and MCHIO achieved 17.1%, 20.4%, and 20.5% success rates under refined envelopes, exceeding their original versions (DM, RAAR, and CHIO) by significant margins. These results establish refined envelope reconstruction as a universal performance enhancer for algorithms relying on solvent flatness constraints, while demonstrating that the proposed improved algorithms, when combined with a refined envelope, expand the available toolkit for direct phasing.
3.4.2. Algorithm Performance Differences and Selection Strategy
Figure 5b shows that continuous algorithms (CHIO, HPR, MCHIO, THIO) and improved classical variants (MDM, MRAAR) perform comparably to the established HIO algorithm when used with the two-step refined envelope scheme. Together, these algorithms form a high-performing group that surpasses classical DM, while ASR and RAAR consistently yield lower success rates.
However, examining success rates per structure in Appendix A Table A2 and Table A5 reveals that no single algorithm excels in all cases. For a given structure, different algorithms can produce widely varying success rates. If only HIO is used and its success rate is near zero for a particular structure (e.g., 3rd5), thousands of random trials may be needed to obtain a solution—a computationally demanding process. With multiple algorithms available, a more efficient strategy emerges: by testing several algorithms with a limited number of trials each (e.g., 100 random seeds per algorithm), one may identify an algorithm that performs substantially better than HIO for that structure, enabling phase determination within hundreds rather than thousands of attempts.
For practical applications to unknown structures, we recommend first testing multiple algorithms from the top-performing group—including continuous algorithms (CHIO, HPR, MCHIO, THIO), improved classical variants (MDM, MRAAR), and HIO—in combination with the two-step refined envelope scheme. If computational resources allow, the genetic algorithm co-evolution strategy should also be employed, as it enhances search efficiency through population-wide information sharing. This integrated approach exploits the benefits of improved envelope accuracy, algorithmic diversity, and collaborative optimization. If any combination succeeds, the structure is considered solved. If all tested combinations show very low success rates, increasing the number of random trials per algorithm may be attempted, though such cases likely belong to a challenging category where additional constraints or experimental data may be needed.
3.4.3. Structural Validation: From Electron Density to Atomic Models
To validate practical utility, we selected seven representative phased structures (3rd5 [47], 2fg0 [48], 1hp4 [58], 1nh6 [59], 2bke [56], 2boz [60], 4gtf [61]) for automated model building. As shown in Figure 6, rebuilt models exhibit excellent spatial alignment with PDB structures, with backbone RMSD below 0.5 Å. Local electron density maps for biologically significant regions, including bound ligands, coordinated ions, and secondary structure motifs, display clarity and spatial continuity sufficient to determine side-chain orientations and resolve structural features. This density quality validates that our direct phasing methods achieve sufficient accuracy to support atomic model construction and enable biological interpretation.
3.5. Evolution of Phasing Strategies: From Full-Resolution to Genetic Co-Evolution
We evaluated three phasing strategies: full-resolution baseline, resolution-weighted progressive, and genetic algorithm-enhanced co-evolution. Figure 7 illustrates the GA-enhanced strategy (incorporating resolution weighting) using structure 3rd5, with results compared against Figure 3. Six panels track the temporal evolution of five quality metrics (mean phase error, envelope intersection-over-union, working R-factor, free R-factor, and solvent density deviation) across 100 individuals, each starting from an independent random seed. The figure shows successful convergence unfolding in two phases: after approximately 5100 iterations, a single individual achieved convergence manifested through abrupt coordinated improvements across all metrics. Subsequently, high-quality structural information from this pioneer propagated throughout the population via genetic operations, guiding the majority of other individuals to converge within a few hundred additional iterations. Upon achieving near 100% population convergence at iteration 5500, early stopping was triggered. The transition region was then linearly reduced to zero over 1500 iterations, followed by 500 iterations of solvent flattening, with the process terminating at iteration 7500.
3.5.1. Resolution-Weighted Strategy: Modest Gains with Computational Trade-Offs
The resolution-weighted strategy prioritizes low-resolution data initially, then progressively introduces high-resolution data. Analysis of Figure 8g shows modest success rate increases of approximately 2 percentage points compared with the full-resolution baseline, regardless of envelope scheme. However, Figure 8h,i reveals increased minimum and median iteration counts. This slowdown occurs because suppressing high-resolution data during early stages, while beneficial for establishing stable low-resolution phases and accurate envelope determination, delays structural detail reconstruction.
3.5.2. Genetic Algorithm Strategy: Breakthrough Through Population Intelligence
The GA co-evolution strategy, introduced in our previous work [32], delivered breakthrough performance by transforming phase retrieval from multi-start independent search into population-based collaborative learning. While our previous study demonstrated GA effectiveness using only the HIO algorithm with traditional one-step envelope reconstruction, the current work extends this evaluation to 10 iterative algorithms, including four continuous IPAs, and examines the synergistic effects when GA is combined with the two-step refined envelope scheme. Statistical analysis (Figure 8a,g) demonstrates that the GA-enhanced strategy significantly improved average success rates compared with the conventional strategy. For coarse envelope design, success rates increased from 12.1% to 43.3%, while for refined envelope design, rates improved from 18.4% to 63.4%, representing an approximately 2.5-fold increase in success rates across both envelope designs. Figure 8d reveals that the GA-enhanced strategy universally improves success rates across all 10 algorithms. Notably, Figure 8l shows that when combining refined envelope reconstruction with the GA phasing strategy, continuous IPAs (CHIO, HPR, MCHIO, THIO) and improved classical algorithms (MDM, MRAAR) achieve success rates comparable to HIO and substantially exceeding DM. In contrast, ASR and RAAR consistently yield the lowest success rates.
Regarding convergence speed, the GA-enhanced strategy shows distinct effects on different convergence metrics (Figure 8e,f,h,i, and Appendix A, Figure A3). Minimum iteration counts remain comparable across all three strategies for both coarse and refined envelope schemes. However, median iteration counts decrease substantially under the GA-enhanced strategy, demonstrating improved overall convergence consistency across both envelope designs. Specifically, with the refined envelope design, median iterations dropped from 3681 (conventional) and 4398 (resolution-weighted) to 2112 (GA-enhanced), representing approximately a 43–52% reduction. This improvement in median convergence reflects GA’s collective optimization mechanism illustrated in Figure 7: successful pioneer individuals emerge and rapidly propagate their high-quality density features throughout the population via genetic operations. The synergy between refined envelope reconstruction and GA strategy enhances convergence reliability—refined envelopes provide accurate spatial constraints that increase pioneer emergence probability, while GA efficiently disseminates validated solutions through population-wide information sharing, thereby elevating the performance of the entire population rather than just isolated optimal cases.
3.6. Correlation Analysis: Solvent Content Dependence and Extended Applicability
Solvent content is a parameter governing phase retrieval difficulty [63,64]. We conducted correlation analysis examining relationships between solvent content and success rate across 28 structures. Figure 9 presents average success rates (across eight algorithms, excluding ASR and RAAR) versus solvent content under three phasing strategies and two envelope schemes. The 28 structures are arranged by increasing solvent content, with each represented by paired bars for coarse (blue) and refined (orange) envelopes. Results reveal that the success rate exhibits an overall increasing trend with rising solvent content, albeit with considerable variation. Within every solvent content interval, the refined envelope (orange bars) exceeds the coarse envelope (blue bars).
Figure 9 shows that several structures (4qb6 [54], 3rd5 [47], 4iqk [65], 4q82 [55], 2fg0 [48], 1nh6 [59], 2bke [56], 4tpl [57]) consistently exhibit below-average success rates regardless of strategy or envelope scheme. Detailed analysis identifies three contributing factors as illustrated in Appendix A, Figure A2 for representative cases. First, extensive surface-bound water molecules, 3rd5 (413 waters, 16% of non-hydrogen atoms), 4q82 (472, 11%), 2fg0 (428, 11%), 1nh6 (635, 13%), reduce bulk solvent available for constraints by requiring loose envelopes for encapsulation. Second, space group symmetry combined with molecular packing produces multiple equivalent origin choices with nearly identical envelopes: two proximate choices in 3rd5, 2fg0, and 4tpl; four converging choices in 4qb6, making it difficult for iterative reconstruction to select a definitive convergence pathway among these equivalent configurations. Third, severe protein-solvent interdigitation creates highly irregular envelope topologies with fragmented bulk solvent in 4iqk, 2bke, and 4tpl, which hinder accurate ab initio envelope reconstruction. Individual structures may exhibit a single factor or combinations; for instance, 3rd5 combines surface waters with origin ambiguity, 4qb6 combines four-choice origin ambiguity with surface waters (approximately 100 water molecules, 8% of atoms), while 4tpl exhibits both interdigitation and origin ambiguity.
Linear regression analysis (Figure 10) of success rates versus solvent content across six methodological combinations (three phasing strategies, two envelope designs) reveals positive correlations (coefficients ∼0.5), confirming the role of solvent constraints. Success rates decline sharply as solvent content decreases, with extrapolated trends suggesting near-zero success below 55% solvent content. Refined envelope regression lines consistently lie above coarse envelope lines across all strategies. These results demonstrate that refined envelope reconstruction combined with a GA strategy extends the applicable range to 55% solvent content by more efficiently exploiting structural constraints rather than overcoming the fundamental physical dependence. Below 55% solvent content, direct phasing remains challenging and requires integration with more constraints or complementary techniques. Analysis of convergence speed (Appendix A Figure A4 and Figure A5) reveals weaker correlations with solvent content compared with success rates. All three strategies show a slight trend toward faster convergence at higher solvent content, with refined envelopes requiring fewer iterations than coarse envelopes at equivalent solvent content.
3.7. Multi-Solution Averaging: Enhanced Precision Through Population Convergence
High GA success rates approaching 100% enable phase accuracy enhancement through multi-solution averaging. When multiple independent runs converge through distinct stochastic pathways to fixed points clustered around the true solution, their residual random errors are uncorrelated. Spatial alignment and averaging of multiple converged solutions suppress random noise, yielding a consensus solution.
Figure 11a compares mean phase error before (red bars) and after (green bars) averaging for 27 successfully solved structures (excluding 4qb6). Green bars are lower, demonstrating a reduction across diverse structures. Figure 11b quantifies improvements: averaging reduced phase error by 6.83° on average, lowering the mean from 39.06° to 32.23°. This enhancement yields electron density maps with improved signal-to-noise ratios, facilitating automated model building and identification of structural details.
Figure 11c reveals no correlation between phase error reduction magnitude and solvent content, indicating universal benefit regardless of structural difficulty. This occurs because averaging suppresses random noise from stochastic optimization aspects rather than systematic biases, providing enhancement for challenging crystals with limited solvent content. Analysis shows diminishing returns beyond approximately 20 averaged solutions (Figure 7f), indicating an optimal cost-benefit balance. Multi-solution averaging imposes minimal computational overhead while delivering a 6.83° phase error reduction, constituting a byproduct of the GA strategy.
4. Discussion
4.1. Theoretical and Practical Value of Continuous Density Modulation
Continuous iterative projection algorithms provide a more physically motivated mathematical description. Traditional algorithms impose binary update strategies at protein-solvent interfaces, creating mathematical discontinuities that contradict continuous electron density and propagate errors, hindering convergence. By introducing transition regions, continuous algorithms achieve smooth density modification, resulting in smoother optimization paths that reduce local minima entrapment. Results across 28 structures demonstrate that continuous algorithms (CHIO, HPR, MCHIO, THIO) achieve performance levels comparable to the well-established HIO algorithm under both envelope schemes (Figure 5b). The smooth density transitions at protein-solvent interfaces can provide stable convergence pathways in certain cases. Continuous iterative algorithms represent an optimization of the physical model through physically motivated mathematical descriptions that eliminate artificial discontinuities. It is important to note that algorithm performance varies across individual structures, and no single algorithm universally excels for all crystallographic scenarios.
4.2. Mechanism and Universal Significance of Refined Envelope Reconstruction
The two-step refined envelope reconstruction (Section 2.3) improved average success rates by approximately 50% (Figure 8a), demonstrating that optimizing foundational constraints can be as impactful as algorithmic innovations. Traditional one-step methods must balance complete protein enclosure against solvent misclassification. Our two-step approach uses coarse-scale smoothing to capture overall molecular shape, then fine-scale analysis to identify trapped solvent. This recovered solvent provides additional constraints for iterative refinement and benefits all algorithm types (Figure 5b). For crystals with limited solvent content where constraint information is already scarce, the ability to reclaim even 5% additional solvent volume provides crucial marginal gains that can determine success or failure.
4.3. Genetic Algorithm Strategy: Mechanism and Boundaries
The GA strategy was introduced in our previous work [32], using the HIO algorithm with traditional one-step envelope reconstruction. In the current study, we demonstrate that this enhancement extends to all 10 tested algorithms, with the combination of GA and a two-step refined envelope achieving approximately a 2.5-fold improvement (Figure 8g). While the GA mechanism itself remains unchanged, the current work reveals important synergistic effects: the refined envelope scheme increases the probability of pioneer emergence by providing more accurate constraints, enabling GA to achieve higher absolute success rates and faster convergence than reported previously. However, important boundaries exist. Regression analysis (Figure 10c,f,i) reveals that when solvent content falls below approximately 55%, even GA success rates approach zero, confirming that GA uncovers existing success probability within solution space but does not create new constraints. For constraint-insufficient problems, any search strategy fails. The primary trade-off of the GA strategy is increased computational cost. However, when GA succeeds, it typically achieves convergence faster (about a 40% reduction in median iterations, Figure 8i) compared with a conventional phasing strategy, partially offsetting the computational overhead. GA should thus be viewed as the optimal tool for excavating maximum success potential under given constraints, particularly valuable when data quality permits but traditional methods show marginal success rates.
4.4. Factors Influencing Phase Recovery Success
Our previous testing experience and the systematic analysis of 28 diverse protein crystals in this work reveal that the success rate is influenced by several primary factors including data quality and structural distribution characteristics. Building on the algorithmic selection guidelines provided in Section 3.4.2, we now analyze these underlying factors and discuss comprehensive strategies for challenging cases.
4.4.1. Data Quality
The quality of diffraction data, encompassing both measurement accuracy and completeness, fundamentally influences phase recovery success. Measurement accuracy, reflected in the PDB-reported R-work values, appears more critical than completeness for the methods presented here. Structures in our test set with R-work below 0.22 generally yielded favorable phasing outcomes, whereas those with substantially higher R-values presented greater challenges due to reduced signal-to-noise ratio, making ab initio envelope reconstruction and phase solution increasingly difficult; addressing such high-error cases remains a direction for future development.
Regarding data completeness, particularly at low resolution, missing reflections (detailed in Appendix A, Table A1) can affect the initial envelope reconstruction. While extensive missing low-resolution data (e.g., 236 reflections for 2boz) could hinder this initial stage, our dynamic data filling strategy (Section 2.5.2) enables the iterative algorithm to compensate progressively using the observed medium- and high-resolution data. Crucially, the algorithm’s ability to reconstruct accurate phases does not solely depend on low-resolution completeness, as evidenced by structures like 4qb6, which has very few missing low-resolution reflections (only 4) yet remains challenging to phase. This indicates that for such cases, the primary obstacles arise from solvent content and structural distribution factors rather than data incompleteness.
4.4.2. Structural Distribution Within the Unit Cell
Beyond molecular shape, the distribution pattern of molecules within the crystallographic unit cell directly determines envelope reconstruction difficulty and consequently affects success rates. When protein surfaces exhibit complex interdigitation with bulk solvent regions (Appendix A Figure A2a, structure 2bke), ab initio envelope reconstruction struggles to delineate clear boundaries between protein and solvent domains. The irregular topology creates ambiguity in envelope determination, hindering convergence.
Crystallographic space groups permit multiple equivalent origin choices related by allowed origin translations. Non-centrosymmetric space groups introduce additional ambiguity through enantiomorphic structures, spatially inverted configurations that produce identical diffraction amplitudes. For instance, space group presents 8 equivalent origin choices from crystallographic symmetry plus 8 from spatial inversion, totaling 16 equivalent origin selections. Iterative algorithms and physical constraints cannot distinguish among these mathematically equivalent origins. Reconstructed envelopes randomly select origins with equal probability. Whether different origin choices yield similar envelopes depends not only on space group symmetry but also on the specific molecular packing within the unit cell. When this occurs, such as when envelopes corresponding to two different origin choices exhibit spatial proximity (Appendix A Figure A2b–d, structure 3rd5) or when four origin choices converge to quite similar envelopes (Appendix A Figure A2f–j, structure 4qb6), the reconstruction process oscillates among equivalent configurations, failing to converge to a definitive solution.
Structures containing numerous fixed water molecules integrated into the molecular surface (Appendix A Figure A2e, structure 3rd5) require loose envelopes to encompass these structured solvent shells. This necessity encroaches upon limited bulk solvent volumes, reducing the effectiveness of solvent flatness constraints. Moreover, a subtle pathology occasionally emerges where envelope reconstruction succeeds, but protein region electron density adopts inverted signs, phases differ by 180°, and diffraction amplitudes remain unchanged. Although histogram matching constraints with positive skewness typically correct this inversion, convergence sometimes stalls at intermediate error states between correct and inverted solutions. The genetic algorithm strategy, with its population-based search and information sharing, is particularly effective at escaping such metastable states and driving the system toward the correct global minimum. The most challenging structures typically combine limited solvent content, measurement errors (high R-factors), and one or more geometric complications, creating compounding difficulties that substantially reduce success rates.
In summary, phase recovery success is governed by the interplay of data quality (measurement accuracy and completeness) and structural characteristics (solvent content, molecular packing complexity, and origin ambiguity). The two-step refined envelope scheme helps mitigate boundary-related issues, while the GA strategy is particularly effective at escaping metastable states arising from these structural complications.
4.5. Practical Considerations and Future Directions
The performance of improved algorithms (MCHIO, MDM, MRAAR) highlights importance of domain-specific adaptation. MCHIO corrects excessive CHIO transition region feedback; MDM and MRAAR explicitly incorporate operators addressing insufficient protein region constraints, elevating performance to levels comparable with HIO. Different algorithms perform optimally under different conditions in Figure 8j–l, indicating that no single universal algorithm exists. Therefore, our multi-level synergistic framework becomes necessary: refined envelope reconstruction provides high-quality constraints; a diverse algorithm toolbox enriches algorithm selection; resolution weighting and GA offer strategic optimization. This transforms direct methods from experience-dependent art into a systematic, reproducible methodological pipeline.
For particularly challenging cases, integrated strategies may be needed. When available, homologous structures or AlphaFold predictions can guide envelope definition before applying direct methods for unbiased phase retrieval. For crystals with medium and low solvent content or complex packing, experimental phasing via anomalous diffraction can provide complementary phase information. When success rates are low but non-zero, increasing computational resources through more trials or larger genetic algorithm populations may eventually yield solutions.
Looking forward, integration of AI-predicted structural models offers promising avenues for further constraint enhancement. Beyond reference histograms, high-confidence prediction regions (pLDDT scores > 90) can serve as partial structural constraints, reducing the phase problem to determining only the uncertain fragments. This hybrid approach may reduce solvent content requirements, extending applicability into the medium- and low-solvent-content range. Combined with the multi-algorithm framework established in this work, such AI-augmented strategies represent a natural evolution toward widely applicable structure determination.
5. Conclusions
This study addresses two challenges in direct-method phase retrieval for protein crystals, discontinuous density modification and crude molecular envelope reconstruction, through a systematic approach. We introduced continuous iterative projection algorithms into this domain, validating their value in enhancing convergence stability by implementing smooth density transitions at the molecular interface. We developed improved algorithm variants, including MCHIO, MDM, and MRAAR, that optimize constraint enforcement, achieving performance levels comparable to or exceeding established methods (Figure 5b). Moreover, the proposed two-step refined envelope reconstruction scheme serves as a universal enhancement, elevating average phase retrieval success rates (Figure 8a). This demonstrates that optimizing foundational constraints can be as impactful as algorithmic innovations. While continuous algorithms demonstrate competitive performance comparable to HIO, algorithm effectiveness varies across individual structures, and no single method universally excels for all cases. For practical structure determination, we recommend testing multiple algorithms from both continuous (CHIO, HPR, MCHIO, THIO) and classical (HIO, MDM, MRAAR) categories. At the strategy level, the resolution-weighted strategy provided limited improvement, while the genetic algorithm co-evolution strategy delivered breakthrough performance that enabled multi-solution averaging, reducing mean phase error by approximately 6.83° (Figure 11a,b). This precision gain was achieved universally across different solvent content levels, consistently improving model quality.
Our methods shift the success rate versus solvent content curve upward, extending the applicability of direct methods within the high-solvent-content range to its lower boundary around 55% (Figure 10). Although method efficacy remains constrained by physical limits, including low solvent contents (approaching or below 55%), insufficient data quality (large R-work values due to measurement errors or poor crystal quality), and complex structural arrangements within the unit cell, continued development is needed. Nevertheless, the framework constructed in this study provides a powerful and systematic solution for the unbiased structure determination of challenging systems in structural biology. This framework synergistically combines refined envelope reconstruction, continuous and improved iterative projection algorithms whose performance is comparable to or exceeds established methods, and genetic algorithm strategies. The compiled algorithms developed in this work are accessible on GitHub [66].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abramson J. Adler J. Dunger J. Evans R. Green T. Pritzel A. Ronneberger O. Willmore L. Ballard A.J. Bambrick J. Accurate structure prediction of biomolecular interactions with Alpha Fold 3Nature 202463049350010.1038/s 41586-024-07487-w 38718835 PMC 11168924 · doi ↗ · pubmed ↗
- 2Terwilliger T.C. Afonine P.V. Liebschner D. Croll T.I. Mc Coy A.J. Oeffner R.D. Williams C.J. Poon B.K. Richardson J.S. Richardson J.S. Accelerating crystal structure determination with iterative Alpha Fold prediction Acta Cryst. D 20237923424410.1107/S 205979832300102 XPMC 998680136876433 · doi ↗ · pubmed ↗
- 3Li Z. Fan H. Ding W. Solving protein structures by combining structure prediction, molecular replacement and direct-methods-aided model completion IU Cr J 20241115216710.1107/S 205225252301029138214490 PMC 10916285 · doi ↗ · pubmed ↗
- 4Sayre D. The squaring method: A new method for phase determination Acta Cryst.19525606510.1107/S 0365110 X 52000137 · doi ↗
- 5Cochran W.T. Relations between the phases of structure factors Acta Cryst.1955847347810.1107/S 0365110 X 55001485 · doi ↗
- 6Karle J. Hauptman H. A theory of phase determination for the four types of non-centrosymmetric space groups 1P 222, 2P 22, 3P 12, 3P 22Acta Cryst.1956963565110.1107/S 0365110 X 56001741 · doi ↗
- 7Schenk H. An Introduction to Direct Methods: The Most Important Phase Relationships and Their Application in Solving the Phase Problem University College Cardiff Press Cardiff, UK 1984
- 8Miller R. De Titta G.T. Jones R. Langs D.A. Weeks C.M. Hauptman H.A. On the application of the minimal principle to solve unknown structures Science 19932591430143310.1126/science.84516398451639 · doi ↗ · pubmed ↗
