CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization

Yicheng Hu; Xinyu Lin; Shulin Li; Wenjie Wang; Fengbin Zhu; Fuli Feng

arXiv:2603.18571·cs.AI·March 20, 2026

CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization

Yicheng Hu, Xinyu Lin, Shulin Li, Wenjie Wang, Fengbin Zhu, Fuli Feng

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

CAPSUL introduces a comprehensive dataset combining 3D structural data and detailed subcellular localization annotations, enabling improved structure-based models for protein localization tasks and biological interpretability.

Contribution

This work provides the first dataset integrating 3D structural information with localization annotations and evaluates models, highlighting the importance of structural features in localization prediction.

Findings

01

Structural features improve localization prediction accuracy.

02

Reweighting and single-label strategies facilitate model training.

03

Interpretability analysis reveals biologically meaningful patterns.

Abstract

Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called $CAPSUL$ , a $C$ omprehensive hum $A$ n $P$ rotein benchmark for $SU$ bcellular $L$ ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. CAPSUL combines 3D structural data with detailed subcellular localization annotations, enabling the development of structure-aware models. 2. The dataset includes 20 subcellular compartments, verified by domain experts, ensuring biological accuracy and interpretability. 3. The authors benchmark good structure-based models and propose reweighting and single-label classification strategies to mitigate class imbalance, showing improvements in underrepresented classes.

Weaknesses

1. The current evaluation is limited to supervised multi-label classification. There is no attempt to leverage structural self-supervised learning or contrastive learning, which are promising directions for structure-aware protein modeling. 2. The structure encoders used are standard graph-based models. More advanced geometric deep learning methods (e.g., SE(3)-equivariant networks, structural diffusion models) are not explored, potentially limiting the upper bound of structural understanding. 3

Reviewer 02Rating 8Confidence 4

Strengths

**Originality and significance** * CAPSUL fills a clear and impactful gap in current bio-ML resources by providing the first benchmark where **structural and sequence modalities can be directly compared** for subcellular localization. * The study yields a **quantitative insight into the trade-off between pre-training and structure**, showing that a modestly sized structure-aware model can match the performance of billion-parameter sequence-only models trained on hundreds of millions of proteins

Weaknesses

* The **structure–sequence trade-off** is not yet quantified in a controlled architectural setting. While the results suggest that explicit structure compensates for the absence of large-scale pre-training, a **single unified model trained in both modalities** (e.g., ESM backbone + structural tokens) would enable direct estimation of their relative contributions. * The paper stops short of leveraging the **hierarchical organization of the labels**. Implementing or evaluating hierarchical loss fu

Reviewer 03Rating 4Confidence 1

Strengths

1.Unified access to structure (Cα, 3Di tokens) plus fine-grained labels and evidence levels; a clear advance beyond existing sequence-only/coarse-label datasets. 2.Broad coverage of representative sequence and structure baselines; reasonable class-imbalance mitigations (reweighting, focal, single-label) and a “randomized structure” ablation that is logically sound.

Weaknesses

1.Evidence-level integration may introduce bias: treating non-experimental annotations as positives could inflate text biases. 2.Missing graph-construction details, i.e., edge criteria (kNN/sequence adjacencies), edge features (relative orientation/distance encodings), normalization, length truncation. 3.You are suggested that provide failure-case analyses (e.g., low pLDDT regions, disordered segments). 4.You should fix minor typos and keep notation consistent. 5.It is better to add fusion basel

Code & Models

Datasets

getbetterhyccc/CAPSUL
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Cell Image Analysis Techniques · Bioinformatics and Genomic Networks