A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models
Hirofumi Tsuruta, Hiroyuki Yamazaki, Ryota Maeda, Ryotaro Tamura,, Akihiro Imura

TL;DR
This paper introduces AVIDa-SARS-CoV-2, a dataset of antibody-virus interactions, and VHHCorpus-2M, a large collection of VHH sequences, to evaluate and improve antibody language models for SARS-CoV-2 binding prediction.
Contribution
The paper provides the first SARS-CoV-2 VHH interaction dataset and a large antibody sequence corpus to benchmark and enhance antibody language models.
Findings
Pre-trained models like VHHBERT improve binding prediction accuracy.
The datasets enable better evaluation of antibody language models.
Benchmark results highlight the potential of AI in antibody discovery.
Abstract
Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInfluenza Virus Research Studies · Machine Learning in Bioinformatics · vaccines and immunoinformatics approaches
