A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody   Language Models

Hirofumi Tsuruta; Hiroyuki Yamazaki; Ryota Maeda; Ryotaro Tamura,; Akihiro Imura

arXiv:2405.18749·cs.LG·October 17, 2024·1 cites

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura,, Akihiro Imura

PDF

Open Access 1 Repo 1 Models 4 Datasets 1 Video

TL;DR

This paper introduces AVIDa-SARS-CoV-2, a dataset of antibody-virus interactions, and VHHCorpus-2M, a large collection of VHH sequences, to evaluate and improve antibody language models for SARS-CoV-2 binding prediction.

Contribution

The paper provides the first SARS-CoV-2 VHH interaction dataset and a large antibody sequence corpus to benchmark and enhance antibody language models.

Findings

01

Pre-trained models like VHHBERT improve binding prediction accuracy.

02

The datasets enable better evaluation of antibody language models.

03

Benchmark results highlight the potential of AI in antibody discovery.

Abstract

Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cognano/AVIDa-SARS-CoV-2
noneOfficial

Models

🤗
COGNANO/VHHBERT
model· 143 dl· ♡ 1
143 dl♡ 1

Datasets

Videos

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models· slideslive

Taxonomy

TopicsInfluenza Virus Research Studies · Machine Learning in Bioinformatics · vaccines and immunoinformatics approaches