AbBiBench: A Benchmark for Antibody Binding Affinity Maturation and Design
Xinyan Zhao, Yi-Ching Tang, Akshita Singh, Victor J Cantu, KwanHo An, Junseok Lee, Adam E Stogsdill, Ibraheem M Hamdi, Ashwin Kumar Ramesh, Zhiqiang An, Xiaoqian Jiang, Yejin Kim

TL;DR
AbBiBench is a comprehensive benchmarking framework that evaluates antibody-antigen complex models for affinity maturation and design, using a large dataset and various modeling approaches to improve antibody engineering.
Contribution
It introduces a novel evaluation framework focusing on the entire antibody-antigen complex and compares multiple models, advancing antibody design methodologies.
Findings
Structure-conditioned inverse folding models outperform others in affinity correlation.
Top models effectively generate antibody variants with improved binding.
Framework facilitates development of more effective antibody design models.
Abstract
We introduce AbBiBench (Antibody Binding Benchmarking), a benchmarking framework for antibody binding affinity maturation and design. Unlike previous strategies that evaluate antibodies in isolation, typically by comparing them to natural sequences with metrics such as amino acid recovery rate or structural RMSD, AbBiBench instead treats the antibody-antigen (Ab-Ag) complex as the fundamental unit. It evaluates an antibody design's binding potential by measuring how well a protein model scores the full Ab-Ag complex. We first curate, standardize, and share more than 184,500 experimental measurements of antibody mutants across 14 antibodies and 9 antigens-including influenza, lysozyme, HER2, VEGF, integrin, Ang2, and SARS-CoV-2-covering both heavy-chain and light-chain mutations. Using these datasets, we systematically compare 15 protein models including masked language models,…
Peer Reviews
Decision·Submitted to ICLR 2026
- Assembling, curating, and standardizing over 186,580 experimental measurements is a very appreciated contribution to the field. Making this dataset public will accelerate future model development. - The authors performed in vitro ELISA assays on 21 designed variants and showed a clear gain-of-function (H1N1 binding) that the wild-type antibody lacked. This validates that the benchmark's top models can be used in a practical, successful design campaign.
- The benchmark, and its in vitro validation, focuses on binding affinity (Kd or ELISA OD signals). In a therapeutic context, the ultimate goal is function (e.g., neutralization, measured by IC50). But affinity is a common proxy used in most computational studies. - While the benchmark's focus on binding affinity is important, it doesn't capture the full picture. Antibody design is a multi-parameter optimization problem, and the authors acknowledge their work would be more beneficial if it inclu
1. The benchmark consists of a wide range of antibodies, antigens, and model architectures, which allows comprehensive and biologically informed evaluation for antibody design models. 2. The study includes in vitro validation, providing strong experimental evidence that supports the findings of the computational benchmark.
1. While AbBiBench provides a biologically meaningful benchmarking pipeline, it primarily integrates existing protein and antibody machine learning models without introducing novel machine learning methodologies. 2. The antibody generation by sampling from the models focuses on a single antigen influenza H1N1. It limits the generalizability of the generation results across diverse antigen targets.
The paper is clear and easy to follow with only some minor unclear parts (especially section 2.1). Authors idea on the benchmarking framework and the benchmark itself is valuable and could be useful in the field. Paper also performs experimental validation which is a big strength in my opinion.
The main weakness of the paper, in my opinion, is related to the scarcity and lack of novelty in terms of the aggregated datasets that are subsequently used in benchmarks. Authors present 15 individual datasets (coming from around 8 individual research articles) featuring roughly 200k measurements. Although this number may seem impressive, virtually each of these datasets was already used as a benchmark in several ML-related publications and majority of them are available publicly in easy to par
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
