CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization
Yicheng Hu, Xinyu Lin, Shulin Li, Wenjie Wang, Fengbin Zhu, Fuli Feng

TL;DR
CAPSUL introduces a comprehensive dataset combining 3D structural data and detailed subcellular localization annotations, enabling improved structure-based models for protein localization tasks and biological interpretability.
Contribution
This work provides the first dataset integrating 3D structural information with localization annotations and evaluates models, highlighting the importance of structural features in localization prediction.
Findings
Structural features improve localization prediction accuracy.
Reweighting and single-label strategies facilitate model training.
Interpretability analysis reveals biologically meaningful patterns.
Abstract
Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called , a omprehensive humn rotein benchmark for bcellular ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and…
Peer Reviews
Decision·ICLR 2026 Poster
1. CAPSUL combines 3D structural data with detailed subcellular localization annotations, enabling the development of structure-aware models. 2. The dataset includes 20 subcellular compartments, verified by domain experts, ensuring biological accuracy and interpretability. 3. The authors benchmark good structure-based models and propose reweighting and single-label classification strategies to mitigate class imbalance, showing improvements in underrepresented classes.
1. The current evaluation is limited to supervised multi-label classification. There is no attempt to leverage structural self-supervised learning or contrastive learning, which are promising directions for structure-aware protein modeling. 2. The structure encoders used are standard graph-based models. More advanced geometric deep learning methods (e.g., SE(3)-equivariant networks, structural diffusion models) are not explored, potentially limiting the upper bound of structural understanding. 3
**Originality and significance** * CAPSUL fills a clear and impactful gap in current bio-ML resources by providing the first benchmark where **structural and sequence modalities can be directly compared** for subcellular localization. * The study yields a **quantitative insight into the trade-off between pre-training and structure**, showing that a modestly sized structure-aware model can match the performance of billion-parameter sequence-only models trained on hundreds of millions of proteins
* The **structure–sequence trade-off** is not yet quantified in a controlled architectural setting. While the results suggest that explicit structure compensates for the absence of large-scale pre-training, a **single unified model trained in both modalities** (e.g., ESM backbone + structural tokens) would enable direct estimation of their relative contributions. * The paper stops short of leveraging the **hierarchical organization of the labels**. Implementing or evaluating hierarchical loss fu
1.Unified access to structure (Cα, 3Di tokens) plus fine-grained labels and evidence levels; a clear advance beyond existing sequence-only/coarse-label datasets. 2.Broad coverage of representative sequence and structure baselines; reasonable class-imbalance mitigations (reweighting, focal, single-label) and a “randomized structure” ablation that is logically sound.
1.Evidence-level integration may introduce bias: treating non-experimental annotations as positives could inflate text biases. 2.Missing graph-construction details, i.e., edge criteria (kNN/sequence adjacencies), edge features (relative orientation/distance encodings), normalization, length truncation. 3.You are suggested that provide failure-case analyses (e.g., low pLDDT regions, disordered segments). 4.You should fix minor typos and keep notation consistent. 5.It is better to add fusion basel
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Cell Image Analysis Techniques · Bioinformatics and Genomic Networks
