SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals
Muhammad Haris Khan

TL;DR
SafeBench-Seq provides a reproducible, CPU-only baseline for protein hazard screening using interpretable features and homology-aware evaluation, addressing biosecurity risks with rigorous, cluster-controlled benchmarking.
Contribution
It introduces SafeBench-Seq, a homology-clustered, metadata-only benchmark and baseline classifier for protein hazard screening based on public data and interpretable features.
Findings
Homology-aware evaluation reduces overestimation of robustness.
Calibrated linear models show good probability calibration.
Tree ensembles have slightly higher Brier scores and ECE.
Abstract
Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate "never-before-seen" threats, we homology-cluster the combined dataset at <=40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Machine Learning in Bioinformatics · vaccines and immunoinformatics approaches
