Automated Protein Motif Localization using Concept Activation Vectors in Protein Language Model Embedding Space
Ahmad Shamail, Claire D. McWhite

TL;DR
This paper introduces a scalable and interpretable method for localizing protein motifs in sequences by leveraging pretrained protein language models and concept activation vectors, achieving high accuracy with minimal additional resources.
Contribution
It adapts concept activation vectors from computer vision to interpret and localize motifs in protein sequences using pretrained language models, providing a lightweight and effective annotation tool.
Findings
Achieves over 85% F1 score in motif localization
Effectively detects multiple motif instances within proteins
Requires only pretrained models and a small CAV dictionary
Abstract
We present an automated approach for identifying and annotating motifs and domains in protein sequences, using pretrained Protein Language Models (PLMs) and Concept Activation Vectors (CAVs), adapted from interpretability research in computer vision. We treat motifs as conceptual entities and represent them through learned CAVs in PLM embedding space by training simple linear classifiers to distinguish motif-containing from non-motif sequences. To identify motif occurrences, we extract embeddings for overlapping sequence windows and compute their inner products with motif CAVs. This scoring mechanism quantifies how strongly each sequence region expresses the motif concept and naturally detects multiple instances of the same motif within the same protein. Using a dataset of sixty-nine well-characterized motifs with curated positive and negative examples, our method achieves over 85\% F1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Machine Learning in Bioinformatics · Bioinformatics and Genomic Networks
