SCOPE: Siamese Contrastive Operon Pair Embeddings for Functional Sequence Representation and Classification
Akarsh Gupta, Kenneth Rodrigues, Sagnik Chatterjee

TL;DR
This paper evaluates protein language model embeddings for operon pair classification, showing they outperform traditional features and are effective for scalable genome annotation.
Contribution
It introduces a Siamese MLP approach that leverages pre-trained embeddings, providing a competitive and scalable method for operon classification.
Findings
Protein language model embeddings outperform physicochemical features in ROC-AUC.
Siamese MLP achieves ROC-AUC of 0.71, competitive with state-of-the-art.
Embedding space geometry captures functional relationships effectively.
Abstract
Identifying operons is a fundamental step in understanding prokaryotic gene regulation, as classifying genes into operons supports the reconstruction of regulatory networks, functional annotation of unannotated genes, and drug candidate development. Experimental approaches such as RT-PCR and RNA-seq provide precise evidence of operon structure, but are laborious and largely limited to well-studied model organisms, making scalable computational methods essential for genome-wide operon identification. Prior computational approaches have employed traditional classifiers such as logistic regression and decision trees, motivating our use of these as physicochemical baselines. The DGEB benchmark evaluates operonic pair classification by embedding each sequence independently with a pre-trained protein language model and computing pairwise cosine similarity. In contrast, our Siamese MLP learns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
