GTpick: A deep neural network for Cryo-EM particle detection

Shenhuan Ni; Chenghui Yang; Yutao Liu; Yuncong Zhang; Yiyan Shi; Anthony Qian; Ren Kong; Shan Chang

PMC · DOI:10.1016/j.csbj.2025.10.029·October 16, 2025

GTpick: A deep neural network for Cryo-EM particle detection

Shenhuan Ni, Chenghui Yang, Yutao Liu, Yuncong Zhang, Yiyan Shi, Anthony Qian, Ren Kong, Shan Chang

PDF

Open Access

TL;DR

GTpick is a new deep learning model for detecting particles in Cryo-EM images, improving accuracy and 3D reconstruction quality.

Contribution

GTpick introduces cross-attention and Focal Loss to enhance particle detection in Cryo-EM images.

Findings

01

GTpick outperforms existing methods in Recall and F1 scores for Cryo-EM particle detection.

02

GTpick improves the resolution of 3D density maps reconstructed from detected particles.

03

GTpick shows limitations in performance on very large datasets.

Abstract

Accurate identification of protein particles in cryo-electron microscopy (Cryo-EM) images is crucial for achieving high-resolution three-dimensional (3D) structural reconstruction. However, this task faces multiple challenges, including low signal-to-noise ratios, densely distributed particles, and class imbalance. To address these issues, this study proposes a target detection algorithm named GTpick, built upon the DETR framework. GTpick introduces a cross-attention mechanism to enhance the interaction between target queries and specific image features. In addition, a grouped one-to-many label assignment strategy is employed to improve recall in densely populated regions, and a Focal Loss function is incorporated to mitigate the adverse effects of background noise and class imbalance on detection accuracy. Experiments on large-scale Cryo-EM datasets demonstrate that GTpick outperforms…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

Cryo

Figures7

Click any figure to enlarge with its caption.

Overall architecture of the GTpick model. This model first uses ResNet101 to extract image features, then applies context modeling through a transformer encoder. The decoder uses self-attention and cross-attention to output class predictions and bounding box coordinates.Fig. 1

Structure of Cross-Attention Model. Here attention scores are computed as dot product operations between Q and K. They are then normalized and applied to V to produce the weighted output representation.Fig. 2

Bounding Box Loss and GIoU Loss curves of the model during training. In both the bounding box and GIoU panels, the loss values decrease continuously and converge after approximately 150 training epochs, indicating that the model has entered a stable optimization phase.Fig. 3

Visual comparison of particle detection results obtained by CryoTransformer, crYOLO, GTpick (ours), and Topaz on representative micrographs from EMPIAR-10081, EMPIAR-10345, and EMPIAR-10532.Fig. 4

Boxplots of 3D reconstruction resolutions obtained from three independent experiments on four Cryo-EM datasets using different particle detection methods: (a) CryoTransformer, (b) Topaz, (c) crYOLO, and (d) GTpick. Lower resolution values (Å) indicate better reconstruction quality. GTpick exhibits smaller variance and lower median values, demonstrating higher stability and superior reconstruction performance.Fig. 5

Local-resolution visualization of 3D reconstructions obtained from representative datasets using CryoTransformer, crYOLO, GTpick, and Topaz. Color bars indicate the local-resolution range from 2 Å (red, higher resolution) to 16 Å (blue, lower resolution). Compared with other particle-picking methods, GTpick yields more extensive high-resolution (red) regions and smoother density continuity, indicating improved structural fidelity and reconstruction consistency.Fig. 6

Keywords

Cryo-EMParticle detectionCryoSPARC3D structure reconstruction

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Electron Microscopy Techniques and Applications · Electron and X-Ray Spectroscopy Techniques · Computational Physics and Python Applications

Full text

Introduction

1

Cryo-electron microscopy is an important technique for studying the three-dimensional structures of biological macromolecules such as proteins. In recent years, it has made significant progress in resolving difficult-to-crystallize samples, including membrane proteins and complex protein assemblies. Cryo-EM overcomes the stringent requirements of crystal preparation required for conventional X-ray crystallography by rapidly cryo-fixing biological samples in a close-to-physiological state. Its standard analytical processes include detection, alignment, classification, averaging, inverse projection of target particles, and ultimately 3D reconstruction of the protein structure [1]. Given this workflow, the quality of particle detection is a major factor that directly determines the accuracy and resolution of the final, reconstructed structure.

However, current particle detection methods still face significant challenges, especially for images with low signal-to-noise ratios, dense particle distributions, and/or severe levels of background interference. Traditional computer vision methods struggle to extract high-level features effectively and have limited detection capability [2], [3], [4]. Existing approaches can be categorized into two types: template matching and machine learning. Template matching relies on predefined reference templates to compare with candidate image regions, but this approach is limited if particle morphology varies greatly or when the background is complex; moreover, this procedure is highly susceptible to ice contamination and background noise.

Machine learning-based methods have substantially improved the automation of particle detection, although important limitations persist. For example, the APPLE picker [5] trains classifiers using positive and negative samples, but it has only been validated on a limited dataset, and its generalization performance remains uncertain. DeepPicker [6] adopts the VGG-Net [7] structure to frame the detection task as image classification; however, its use of sliding-window cropping limits efficiency. FastParticlePicker [8] uses a Fast R-CNN framework to enhance detection speed but remains sensitive to background noise. DeepEM [9], which uses the AlexNet architecture [10] for iterative training, suffers from instability when data are limited. crYOLO [11] leverages the YOLO [12] target detection framework to achieve high-precision automated detection but lacks generalization when faced with significant variation in particle size and shape. DeepConsensus [13] aggregates the results of multiple methods to build a high-quality dataset, albeit with high computational complexity. Topaz [14] uses a positive-unlabeled learning framework requiring only a small number of labeled samples, but under default settings, it tends to select too many overlapping particles and false positives, often requiring manual threshold adjustment.

In recent years, the Transformer framework has been introduced into Cryo-EM particle detection. CryoTransformer [15] applies self-attention mechanisms to capture long-range dependencies between particles, compensating for the limitations of traditional convolutional networks in global feature extraction. However, its robustness is insufficient in noisy images, and its one-to-one label matching strategy limits performance in densely populated regions.

To address these challenges, this paper proposes an improved Transformer-based detection algorithm named GTpick. First, a cross-attention mechanism is added to the Transformer decoder to enhance spatial interaction between target queries and image features. Second, a grouped one-to-many label assignment strategy [16] is introduced to improve recall in densely populated regions. Third, Focal Loss [17] is used for classification to reduce the negative effects of class imbalance between background and target categories, helping the model focus more on difficult-to-recognize particles. Experimental results show that, compared with existing methods, GTpick significantly improves particle detection performance while maintaining a lightweight architecture. Moreover, 3D reconstruction using CryoSPARC software verifies that the particle coordinates generated by GTpick lead to higher resolution reconstructed structures.

Materials and methods

2

This study develops an end-to-end object detection network based on the DETR framework [18]. This approach consists of three primary modules: (1) a feature extraction module, (2) a transformer-based feature modeling module, and (3) a label matching and loss calculation module. The overall network architecture is illustrated in Fig. 1 To address the challenges posed by densely distributed targets and the difficulty of distinguishing targets from background in Cryo-EM particle detection tasks, this study introduces a series of customized improvements to the traditional Transformer decoder. Specifically, we incorporated a dynamic reference point mechanism to enhance positional awareness, as well as adaptive positional encoding [19] and cross-attention mechanisms to strengthen spatial information representation. In addition, a group-wise one-to-many assignment method is implemented in the label matching process, enabling more accurate detection of densely packed targets by improving recall. In the loss function design, the classification branch adopts Focal Loss to mitigate the impact of class imbalance, while the bounding box regression branch combined L1 Loss and GIoU Loss [20] to enhance localization accuracy.Fig. 1. Overall architecture of the GTpick model. This model first uses ResNet101 to extract image features, then applies context modeling through a transformer encoder. The decoder uses self-attention and cross-attention to output class predictions and bounding box coordinates.Fig. 1

Backbone network

2.1

In cryo-electron microscopy (Cryo-EM) images, target structures often exhibit blurred boundaries, if the background contains complex noise. Therefore, the backbone network must possess strong representational capabilities and robust feature extraction. To meet these requirements, this study adopts ResNet-101 as the primary feature extraction module and incorporates several structural optimization strategies to better integrate with the subsequent Transformer encoder.

We constructed the backbone network using a standard ResNet-101 architecture [21], we initialize the backbone network with pretrained weights, which accelerates model convergence under limited training samples and prevents the loss of details in Cryo-EM images. In addition, to address the instability of feature distribution caused by BatchNorm under small-batch training conditions [22], this study replaces standard BatchNorm with FrozenBatchNorm2d to ensure consistent feature distribution across various training settings.

Transformer module

2.2

When used to detect protein particles, Cryo-EM images typically suffer from low signal-to-noise ratios, blurred target boundaries, and issues related to the dense packing of individual proteins. Traditional convolutional neural networks (CNNs) are limited by their local receptive fields and can therefore struggle to model long-range context. These limitations manifest as false positives and missed detection opportunities. To address this problem, we introduce an improved Transformer architecture that performs global context modeling and target feature interactions. Our overall design follows the DETR framework and consists of an encoder and a decoder.

The original feature map output from the backbone network is denoted as $[eqn]$ . To input this map into the Transformer encoder, it must first be flattened into a sequence format as follows: $[eqn]$ where $[eqn]$ . To enhance the model’s perception of spatial positional information, we introduce a position encoding $[eqn]$ based on trigonometric functions. This positional encoding is then added to the input feature map, which then forms the final input of the Transformer encoder: $[eqn]$ .

The primary function of the encoder module is to perform high-level contextual modeling on the feature set extracted by the backbone network, thereby enhancing the contextual understanding of the model. The encoder module itself consists of multi-head attention and feed-forward networks, accompanied by residual connections and layer normalization.

The decoder is composed of N stacked layers, each consisting of three sub-modules: self-attention, cross-attention, and a feed-forward network. The input to the decoder is the query vector q generated by the previous layer. Initially, the query vector is linearly projected to generate the query and key vectors, which are used to capture the relationships between different target instances. This is as follows: $[eqn]$ , $[eqn]$ and its output is processed using residual connections and layer normalization:

[eqn]

The core innovation of the decoder is to compute cross-attention between the query vectors and the encoder output features as $[eqn]$ , while simultaneously introducing an explicit reference point positional encoding vector $[eqn]$ to enhance spatial localization. The resulting cross-attention is calculated as follows:

[eqn]

This output is then further processed through residual connections and layer normalization by:

[eqn]

A conceptual diagram of the cross-attention module is shown in Fig. 2 By integrating content features with positional features, the cross-attention mechanism effectively enhances the model’s ability to distinguish between background and densely packed targets.Fig. 2. Structure of Cross-Attention Model. Here attention scores are computed as dot product operations between Q and K. They are then normalized and applied to V to produce the weighted output representation.Fig. 2

In standard DETR, the decoder employs a fixed number of query vectors, and each target instance can be matched to at most one query (i.e., “one-to-one matching”). In densely populated scenes, however, this may be insufficient and may lead to false negatives. To overcome this limitation, we introduce a group-wise query mechanism (group-wise matching), which explicitly partitions the query vectors into G sub-groups, each containing $[eqn]$ sub-queries, where: $[eqn]$ .

During training, each group query independently participates in cross-attention calculations and interacts with target labels. For example, the cross-attention calculation process for the $[eqn]$ group query is defined by: $[eqn]$ .

[eqn]

The output of the group query is then concatenated across calculation stages and combined for label matching and supervision optimization using the Hungarian Matching algorithm [23]. In general, this group-wise mechanism is designed to effectively enhance the detection of small particles and densely populated regions, and is useful for increasing local target coverage and enhancing the detection of protein particles captured in Cryo-EM images.

Loss function design and optimization

2.3

To facilitate end-to-end particle detection training using Cryo-EM images, this study designed a composite objective function that comprehensively considers target classification, bounding box regression, and overlap quality as three separate error components.

Next, we addressed the severe class imbalance problem in Cryo-EM images. To do so, we adopted Focal Loss [17] as the primary classification loss function; this was formalized as follows:

[eqn]

Here, $[eqn]$ represents the predicted probability of the i target belonging to the correct class, $[eqn]$ is the class balancing factor. This loss effectively reduces the impact of easily classified negative samples, and thereby enhances the model’s focus on challenging classes.

For the bounding box regression, an L1 loss is employed to measure the positional difference between the predicted and ground truth coordinates:

[eqn]

Here, $[eqn]$ represents the center coordinates and scale (normalized) of the predicted bounding box, while $[eqn]$ denotes the corresponding ground truth box. Here we adopt GIoU Loss as a measure of the alignment between the predicted and ground truth boxes. Minimizing this value can effectively mitigate potential optimization difficulties caused by incomplete overlap. The GIoU Loss is defined as follows:

[eqn]

Here, GIoU measures the overlap between the predicted and ground truth boxes which accounting for the minimum bounding region that encompasses both; the result is a more stable optimization target than can be obtained by calculating the traditional IoU (Intersection over Union) [12].

Finally, the total loss function is defined as a weighted sum of these three components:

[eqn]

This design achieves a unified optimization of target classification and localization quality, ensuring the model’s stable and high-precision performance in detecting densely packed small particles.

Post-processing and three-dimensional reconstruction

2.4

A.star file [24] is generated after completing training of the GTpick model. This is a standard format for storing metadata in Cryo-EM. In general, this file records key parameters for each particle found in the micrographs, including coordinates, rotation angles, and automated scoring. These data can be directly imported into CryoSPARC [25] to facilitate subsequent analyses of protein structure.

To do so we first import all Cryo-EM micrographs into CryoSPARC. Next, Contrast Transfer Function (CTF) estimation is performed to assess image quality and apply aberration correction. For example, CTF correction compensates for physical distortions that occur during imaging by enhancing the signal-to-noise ratio. This provides more accurate inputs for subsequent particle extraction and structure reconstruction steps.

Next, CryoSPARC extracts particles based on coordinate information found in the.star file, records the total number of particles, and automatically selects high-quality particles for further analyses. Following particle extraction, individual particle images are cropped from the micrographs, forming an initial particle dataset. This dataset then undergoes two-dimensional (2D) classification to improve data quality; this step removes abnormal or blurry particles and reduces noise interference in structure analysis. After 2D classification, high-quality particles with consistent morphology are selected for three-dimensional (3D) initial reconstruction; this process generates a low-resolution density map. The resolution of the reconstructed structure is then progressively enhanced via homogeneous refinement, heterogeneous refinement, and further local refinement steps.

Next, we calculated the Gold Standard Fourier Shell Correlation (GSFSC) to evaluate the reliability of the final reconstructed structure. This provides a quantitative measure of the resolution of the density map. GSFSC effectively assesses the structural signal within the reconstructed map, offering a confidence evaluation of the results.

Finally, we imported the optimized 3D density map into visualization software (e.g., UCSF Chimera [26]) to generate a clear and accurate three-dimensional protein structure model. This model can then be used to support subsequent structural analyses and studies of biological function.

Dataset and training setup and performance evaluation

3

Dataset

3.1

In this study, the CryoPPP [27], [28] public dataset is used for model training and evaluation. It is characterized by its large sample size, diverse particle types, and high-quality annotations. To construct the training process, the data are divided into training, validation, and test sets in a ratio of 8:1:1, and the ground truth particle coordinates provided in the dataset are used as supervisory signals to ensure annotation accuracy and consistency. All labeled information is uniformly organized in COCO format, with target category, bounding box location, and image attributes recorded in structured JSON files. This approach standardizes the data processing workflow, improves compatibility with mainstream detection frameworks, and provides a uniform and reliable foundation for model training and evaluation [15].

Experimental configuration and model training

3.2

The experimental environment in this study is based on a CENTOS system, equipped with an NVIDIA RTX 4090 graphics card (24 GB of video memory). Model construction and training are implemented using the PyTorch 1.7.1 framework and CUDA 10.2. The training parameters are set as follows: the input size is 640 × 640; the AdamW optimizer [29] is used; the initial learning rate is set to 10⁻⁴; the learning rate of the backbone network is set to 10⁻⁵; and the weight decay parameter is set to 10⁻⁴. All network weights are randomly initialized using the Xavier initialization method [30]. In the attention mechanism and feed-forward network, a dropout of 0.1 is added before LayerNorm to enhance the model’s generalization ability. During training, the batch size is set to 2. Each training round takes approximately 10 min, and the total training time is 63 h. Fig. 3 shows the change curve of Bounding Box Loss and GIoU Loss during training.Fig. 3. Bounding Box Loss and GIoU Loss curves of the model during training. In both the bounding box and GIoU panels, the loss values decrease continuously and converge after approximately 150 training epochs, indicating that the model has entered a stable optimization phase.Fig. 3

As shown in Fig. 3, the overall trend of the two loss values is steadily decreasing, indicating that the model continues to achieve effective optimization in the target localization task. With the introduction of a regularization mechanism, the loss values decrease significantly and converge after approximately 150 rounds, indicating that the model has entered a stable training phase and is adequately learning target box positions.

Evaluation metrics

3.3

In evaluating particle selection algorithms in cryo-electron microscopy (Cryo-EM) images, two methods are commonly used.

The first method uses manually selected particles as a reference benchmark and employs precision, recall, and F1-score as evaluation metrics.

Another method involves importing particles selected by different algorithms into CryoSPARC software for 3D reconstruction. It then involves comparing the final resolution of the resulting 3D density maps. Using this approach, higher resolutions indicate better particle quality, more accurate localization, and finer structural reconstruction, thus indirectly reflecting the overall quality of the detection algorithm.

Using these two evaluation approaches, we next systematically compared and analyzed the performance of various Cryo-EM particle detection algorithms to verify the effectiveness and advantages of the proposed GTpick method regarding particle detection and 3D reconstruction.

Method validation and performance evaluation

4

Ablation experiments

4.1

To evaluate the contribution of each improvement proposed in this study to the performance of Cryo-EM particle detection, ablation experiments were performed on the EMPIAR-10017 dataset. This dataset, which contains β-galactosidase micrographs with moderate image size, a sufficient number of micrographs, and well-defined particle features, provides an ideal balance between computational tractability and biological complexity. These properties make EMPIAR-10017 a representative and efficient benchmark for evaluating model behavior under realistic conditions.

Four configurations were compared under identical training settings to ensure fairness and comparability. As shown in Table 1, Improvement 1, which integrates Cross-Attention and One-to-Many label assignment, achieved a Precision of 0.621, Recall of 0.434, and F1 Score of 0.511. Improvement 2, which replaces the loss function with Focal Loss while retaining Cross-Attention, alleviated class imbalance and background noise, yielding 0.618 Precision, 0.443 Recall, and 0.516 F1 Score. Improvement 3 combined One-to-Many assignment with Focal Loss, resulting in 0.596 Precision, 0.448 Recall, and 0.511 F1 Score. Improvement 4, which integrates all three strategies (Cross-Attention, One-to-Many, and Focal Loss), achieved the best performance (Precision = 0.643, Recall = 0.450, F1 = 0.529). These results demonstrate that the combined use of these modules synergistically enhances the accuracy and robustness of true particle detection.Table 1. Results of Ablation study.Table 1. ModelCross-AttentionOne-to-ManyFocal LossPrecisionRecallF1 ScoreImprovement 1√√×0.6210.4340.511Improvement 2√×√0.6180.4430.516Improvement 3×√√0.5960.4480.511Improvement 4√√√0.6430.4500.529

Comparison experimental

4.2

Comparative evaluation across multiple Cryo-EM datasets

4.2.1

In this study, we selected six representative cryo-electron microscopy (Cryo-EM) datasets, namely EMPIAR-10081, EMPIAR-11056, EMPIAR-10345, EMPIAR-10532, EMPIAR-10093, and EMPIAR-10947, all of which were obtained from the CryoPPP database. These datasets were used to systematically evaluate the performance of GTpick and other mainstream particle detection algorithms in terms of precision, recall, and F1-score. During the evaluation, the particle coordinates detected by each algorithm were matched against the manually annotated ground truth provided in the CryoPPP dataset to assess the coverage capability of each method in identifying protein particles in Cryo-EM images. Detailed information on the selected datasets is provided in Table 2.Table 2. Detailed information of the five representative Cryo-EM datasets used for algorithm evaluation.Table 2EMPAIR IDProtein TypeSize (TB)Image sizeParticle Diameter (px)Total Structure Weight (kDa)10081Transport Protein0.052(3710, 3838)154298.5710093Membrane Protein0.097(3838, 3710)172779.410532Viral Protein0.196(4096, 4096)174191.7611056Transport Protein0.164(5760, 4092)16488.9410345Signaling Protein0.085(3838, 3710)149244.6810947Viral Protein0.048(4096, 4096)240443.92

To ensure a fair comparison, the CryoTransformer model was retrained on the same CryoPPP dataset using identical training parameters, preprocessing procedures, and hyperparameter settings as GTpick. In addition, both Topaz and crYOLO were fine-tuned (instead of directly using the official pretrained models) to better adapt to the CryoPPP dataset. The particle coordinates detected by each algorithm were subsequently used for downstream 2D classification and 3D reconstruction. Table 3 presents a systematic comparison of the detection performance of GTpick, CryoTransformer, Topaz, and crYOLO across the six EMPIAR test datasets.Table 3. Comparison of particle detection performance of different methods on five EMPIAR datasets.Table 3EMPIAR IDParticles In ground TruthRecallPrecisionF1 ScoreCryoTrans-formerTopazcrYOLOGTpickCryoTrans-formerTopazcrYOLOGTpickCryoTrans-formerTopazcrYOLOGTpick10081393520.6730.6730.5770.6810.3860.4070.5060.4490.4910.5070.5390.54110093563940.6070.6030.4300.6090.3060.4860.5270.3810.4070.5380.4730.46810532879330.5290.5070.3560.5350.4340.5520.6730.5200.4770.5260.4650.527110561259080.4270.3640.3930.4400.4600.5510.5760.5600.4430.4390.4670.49310345158940.8420.8110.5440.8500.1760.3570.5250.2070.2910.4960.5350.332109471063930.4410.1860.1470.4600.2910.7050.7500.2880.3500.2940.2460.354Average0.5870.5240.4080.5960.3420.5100.5930.4010.4100.4670.4540.453Note: GTpick parameters: 62.34 M; CryoTransformer parameters: 75.94 M.

Across the six test datasets, GTpick consistently demonstrates superior recall performance, particularly achieving the highest values on challenging datasets such as EMPIAR-10081, EMPIAR-10532, EMPIAR-11056, and EMPIAR-10345, which feature complex backgrounds or densely distributed particles. Even on the large-scale EMPIAR-10947 dataset with a greater number of micrographs, GTpick maintains stable performance, highlighting its strong robustness and cross-dataset generalization capability.

As shown in Table 3, GTpick outperforms CryoTransformer, crYOLO, and Topaz in terms of recall, while also achieving competitive precision, resulting in overall higher F1-scores. In addition, GTpick has a model size of 62.34 million parameters, approximately 18 % fewer than CryoTransformer’s 75.94 million, demonstrating significantly reduced computational resource consumption without compromising detection accuracy.

The lower precision observed on the EMPIAR-10345 and EMPIAR-10947 datasets is mainly attributed to the intrinsic characteristics of these datasets.

Specifically, EMPIAR-10345 corresponds to a signaling protein complex, which features relatively large particle diameters (approximately 450 px) and a high particle density per micrograph. Such dense distributions often cause particle overlap or partial merging, increasing the likelihood of misclassification in boundary regions. This leads to a higher number of false positives and consequently reduces precision. In contrast, the EMPIAR-10947 dataset contains viral protein particles with smaller molecular weight (approximately 443 kDa) and lower image contrast. The low signal-to-noise ratio (SNR) and irregular particle morphology result in blurred particle boundaries, which tend to cause over-detection and thus lower precision. Nevertheless, GTpick maintains a relatively high recall on both datasets, indicating that the model remains highly sensitive to true particle regions.

To qualitatively assess the performance differences among particle detection algorithms, Fig. 4 presents the particle picking results of GTpick, CryoTransformer, crYOLO, and Topaz on representative micrographs from EMPIAR-10081, EMPIAR-10345, and EMPIAR-10532. Each colored overlay indicates the particle locations detected by the corresponding method: CryoTransformer (yellow), crYOLO (red), GTpick (yellow), and Topaz (green).Fig. 4. Visual comparison of particle detection results obtained by CryoTransformer, crYOLO, GTpick (ours), and Topaz on representative micrographs from EMPIAR-10081, EMPIAR-10345, and EMPIAR-10532.Fig. 4

As shown in Fig. 4, CryoTransformer performs well on sparsely distributed particles but tends to merge adjacent particles or miss smaller ones in high-density regions. crYOLO exhibits high sensitivity but produces a large number of false positives, often misidentifying background noise as particles, particularly in EMPIAR-10532 with low contrast and variable particle size. Topaz, conversely, demonstrates a conservative detection behavior, frequently missing closely packed particles. In contrast, GTpick achieves a more accurate estimation of the true number of particles in each micrograph. This is particularly important for downstream 3D reconstruction, where the number and quality of picked particles directly determine the achievable resolution.

Comparisons of 3D density map reconstruction and resolution

4.2.2

Next, we comprehensively evaluated GTpick’s performance in Cryo-EM particle detection from a 3D reconstruction perspective. To achieve this, five representative datasets—EMPIAR-10081, EMPIAR-10345, EMPIAR-10532, EMPIAR-10093, and EMPIAR-10947—were selected for systematic evaluation. The particles detected by each method were imported into the CryoSPARC platform for 3D density reconstruction. Reconstruction proceeded after 2D classification using the Select 2D module, which removed low-quality or abnormal particles to enhance the consistency of the input particle set and improve the accuracy of the final structure.

To ensure statistical reliability and experimental stability, three independent reconstruction experiments were conducted for each dataset and each particle detection method. All experiments followed identical preprocessing parameters and reconstruction workflows to maintain fairness and reproducibility.

The resolution distributions obtained from the three repeated experiments are illustrated in Fig. 5 as boxplots, enabling an intuitive comparison of reconstruction stability and overall performance among different methods. The smaller spread and lower median values observed in GTpick indicate its higher stability and accuracy in 3D reconstruction.Fig. 5. Boxplots of 3D reconstruction resolutions obtained from three independent experiments on four Cryo-EM datasets using different particle detection methods: (a) CryoTransformer, (b) Topaz, (c) crYOLO, and (d) GTpick. Lower resolution values (Å) indicate better reconstruction quality. GTpick exhibits smaller variance and lower median values, demonstrating higher stability and superior reconstruction performance.Fig. 5

In addition to the boxplots, Table 4 summarizes the best reconstruction resolutions achieved across all datasets. Furthermore, the Fourier Shell Correlation (FSC) curves corresponding to these best reconstructions are provided in Fig. S1, further confirming the reliability and consistency of GTpick’s 3D reconstruction performance across different datasets. Meanwhile, Table S1 presents the complete resolution records from all three independent reconstruction experiments, offering a comprehensive quantitative reference. Overall, GTpick demonstrates higher reconstruction consistency and superior resolution performance across all datasets, showing a clear advantage over CryoTransformer, Topaz, and crYOLO.Table 4. Comparison of 3D reconstruction resolution (Å) of different methods on the EMPIAR dataset.Table 4EMPIAR IDNumber of Micrographs3D Resolution (Å)CryoTransformerTopazcrYOLOGTpick100813007.307.766.566.29100933005.356.366.285.26103453005.8210.266.665.17105323055.645.995.665.41109474005.297.497.314.93

The 3D reconstruction resolution was quantified using the Gold Standard Fourier Shell Correlation (GSFSC) criterion, which measures reconstruction accuracy and confidence. Lower GSFSC values correspond to higher fidelity and improved particle detection quality. As shown in Table 4, GTpick consistently achieved the best reconstruction resolution across all datasets. Specifically, GTpick reached 6.29 Å, 5.26 Å, 5.17 Å, 5.41 Å, and 4.93 Å on the EMPIAR-10081, EMPIAR-10093, EMPIAR-10345, EMPIAR-10532, and EMPIAR-10947 datasets, respectively—surpassing CryoTransformer, Topaz, and crYOLO. These results demonstrate that GTpick not only identifies a greater number of high-quality particles but also contributes to more accurate and higher-resolution 3D reconstructions.

Furthermore, the reconstructed 3D density maps corresponding to the best resolutions were analyzed using UCSF ChimeraX to visualize the local resolution distribution of each map, as shown in Fig. 6. This visualization intuitively reveals the spatial variation of resolution within the reconstructed volumes and further demonstrates that GTpick yields more uniform and higher-quality 3D reconstructions compared with other particle picking methods.Fig. 6. Local-resolution visualization of 3D reconstructions obtained from representative datasets using CryoTransformer, crYOLO, GTpick, and Topaz. Color bars indicate the local-resolution range from 2 Å (red, higher resolution) to 16 Å (blue, lower resolution). Compared with other particle-picking methods, GTpick yields more extensive high-resolution (red) regions and smoother density continuity, indicating improved structural fidelity and reconstruction consistency.Fig. 6

To further evaluate the structural consistency between the reconstructed density maps and experimentally determined atomic models, the corresponding PDB structures were fitted into the density maps for visual comparison. Specifically, the fitted atomic models corresponding to EMPIAR-10081 and EMPIAR-10532 are 5U6O [31] and 6WXB [32], respectively. The overlay results, shown in Fig. S2, demonstrate clear structural alignment and spatial correspondence between the atomic models and the reconstructed maps. These visualizations further confirm that the reconstructions derived from GTpick-detected particles exhibit higher fidelity and better agreement with the known atomic structures.

Finally, Fig. S3 presents the complete local resolution maps for all five datasets, further validating GTpick’s capability and consistency in 3D reconstruction across different datasets.

Cross-dataset generalization evaluation and model generalization analysis

4.3

To further evaluate the cross-dataset generalization capability of GTpick, an additional experiment was conducted where the model was trained solely on the EMPIAR-10081 dataset and tested on five independent CryoPPP datasets, including EMPIAR-10093, EMPIAR-10532, EMPIAR-11056, EMPIAR-10345, and EMPIAR-10947. Unlike the main experiment that adopted an 8:1:1 train–validation–test split across multiple datasets, this test aimed to simulate a low-data and cross-domain scenario more representative of practical Cryo-EM applications, where annotated particle data are often scarce. In this setting, EMPIAR-10081 contained only 270 training and 30 validation micrographs. The limited data volume and lack of diversity reduced the model’s exposure to varying defocus levels, contrast patterns, and molecular orientations, leading to a clear performance decline when applied to unseen datasets.

As summarized in Table 5, GTpick exhibited varying levels of performance across the five test datasets, with Precision ranging from 0.108 to 0.335, Recall from 0.365 to 0.693, and F1-scores between 0.187 and 0.383.Table 5. Performance metrics of the GTpick model trained on the EMPIAR-10081 dataset and evaluated on five independent CryoPPP datasets (EMPIAR-11056, EMPIAR-10093, EMPIAR-10345, EMPIAR-10532, and EMPIAR-10947) for cross-dataset generalization assessment.Table 5EMPIAR IDPrecisionRecallF1 Score100930.2090.4830.292105320.3160.4850.383110560.3350.3650.349103450.1080.6930.187109470.2430.3680.293

These fluctuations highlight the inherent sensitivity of deep learning–based particle pickers to dataset characteristics such as noise distribution, particle density, and imaging contrast.

Nevertheless, GTpick retained reasonable recall performance (e.g., 0.483 on EMPIAR-10093 and 0.693 on EMPIAR-10345), demonstrating its ability to identify potential particle regions even under domain shifts.

Discussion

5

In this study, we present GTpick, a novel detection method based on the DETR framework, designed to address the core challenges of low signal-to-noise ratios (SNRs) and densely distributed targets in cryo-electron microscopy (Cryo-EM) particle detection. This method significantly enhances target feature modeling and contextual representation while maintaining a compact network architecture and low training overhead, thereby improving detection accuracy and robustness. When integrated with the CryoSPARC 3D reconstruction workflow, GTpick produces higher-resolution density maps, leading to more accurate and reliable protein structure determination.

Evaluation on multiple EMPIAR datasets demonstrates that GTpick outperforms existing methods such as CryoTransformer and Topaz, achieving superior particle recognition and improved 3D reconstruction quality across multiple test sets. These findings further emphasize the critical role of high-quality particle detection in enabling accurate structural reconstruction.

Despite GTpick’s strong performance in particle count, recall rate, and reconstruction quality, the broader target coverage is accompanied by a certain level of false detections. Future research could focus on optimizing target matching mechanisms and post-processing strategies to further reduce false positives and improve precision across diverse datasets.

Moreover, an interesting and open question remains: human experts still outperform machine-learning-based methods in particle selection accuracy. This phenomenon highlights the sophisticated cognitive integration of spatial, contextual, and contrast cues that current algorithms cannot yet fully emulate. Nevertheless, we also emphasize the promising potential of emerging deep-learning paradigms, such as self-supervised learning, multimodal data integration, and domain adaptation, which may enable future methods to reach or even surpass human-level performance.

In such a scenario, reliance on human-labeled datasets as the sole ground truth will become increasingly limited, motivating the development of new evaluation frameworks based on structural consistency and density-map fidelity, thereby providing a more objective and comprehensive assessment of automated particle detection and structural reconstruction quality.

Code and data availability

To support reproducibility and further research, the full GTpick training and evaluation pipeline is available at: https://github.com/nshstart/GTpickcom3.git. The datasets used in this study are publicly available.

CRediT authorship contribution statement

Shenhuan Ni: Writing – review & editing, Methodology, Data curation, Conceptualization. Anthony Qian: Writing – review & editing. Ren Kong: Supervision, Conceptualization. Shan Chang: Writing – review & editing, Supervision, Methodology, Conceptualization. Chenghui Yang: Writing – review & editing, Conceptualization. Yutao Liu: Writing – review & editing, Conceptualization. Yuncong Zhang: Writing – review & editing, Conceptualization. Yiyan Shi: Writing – review & editing.

Funding

This work was financially supported by the funding received from the 10.13039/501100001809National Natural Science Foundation of China (Nos. 62373172).

Declaration of Competing Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Cheng Y., Grigorieff N., Penczek P.A., Walz T.J.C. A primer to single-particle cryo-electron microscopy. 2015;161(3):438-449.10.1016/j.cell.2015.03.050PMC 440965925910204 · doi ↗ · pubmed ↗
2Roseman A.M.J.U. Particle finding in electron micrographs using a fast local correlation algorithm. 2003;94(3-4):225-236.10.1016/s 0304-3991(02)00333-912524193 · doi ↗ · pubmed ↗
3Huang Z., Penczek P.A.J.Jo.S.B.. Application of template matching technique to particle detection in electron micrographs. 2004;145(1-2):29-40.10.1016/j.jsb.2003.11.00415065671 · doi ↗ · pubmed ↗
4Voss N., Yoshioka C., Radermacher M., Potter C., Carragher B.J.Josb. Do G Picker and Tilt Picker: software tools to facilitate particle selection in single particle electron microscopy. 2009;166(2):205-213.10.1016/j.jsb.2009.01.004PMC 276839619374019 · doi ↗ · pubmed ↗
5Heimowitz A.Andén J.Singer A.J.Josb APPLE picker: automatic particle picking, a low-effort cryo-EM framework J Struct Biol 204220182152273013415310.1016/j.jsb.2018.08.012PMC 6183064 · doi ↗ · pubmed ↗
6Wang F.Gong H.Liu G.Li M.Yan C.Xia T.Deep Picker: a deep learning approach for fully automated particle picking in cryo-EMJ Struct Biol 195320163253362742426810.1016/j.jsb.2016.07.006 · doi ↗ · pubmed ↗
7Simonyan K.A Japa Zisserman Very Deep convolutional Netw Large Scale Image Recognit 2014
8Xiao Y.Yang G.A fast method for particle picking in cryo-electron micrographs based on fast R-CNN. Aip conference proceedings 2017 AIP Publishing