SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals

Muhammad Haris Khan

arXiv:2512.17527·cs.LG·December 22, 2025

SafeBench-Seq: A Homology-Clustered, CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals

Muhammad Haris Khan

PDF

Open Access

TL;DR

SafeBench-Seq provides a reproducible, CPU-only baseline for protein hazard screening using interpretable features and homology-aware evaluation, addressing biosecurity risks with rigorous, cluster-controlled benchmarking.

Contribution

It introduces SafeBench-Seq, a homology-clustered, metadata-only benchmark and baseline classifier for protein hazard screening based on public data and interpretable features.

Findings

01

Homology-aware evaluation reduces overestimation of robustness.

02

Calibrated linear models show good probability calibration.

03

Tree ensembles have slightly higher Brier scores and ECE.

Abstract

Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate "never-before-seen" threats, we homology-cluster the combined dataset at <=40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBioinformatics and Genomic Networks · Machine Learning in Bioinformatics · vaccines and immunoinformatics approaches