AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

Ling Luo; Wenbin Jiang; Hongyuan Chang; Xinkang Wang; Xushi Zhang; Yueting Xiong; Mengsha Tong; Rongshan Yu

arXiv:2602.04916·q-bio.QM·May 21, 2026

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

Ling Luo, Wenbin Jiang, Hongyuan Chang, Xinkang Wang, Xushi Zhang, Yueting Xiong, Mengsha Tong, Rongshan Yu

PDF

3 Reviews

TL;DR

AFD-Instruction is a large-scale dataset with functional annotations that enhances LLMs' ability to interpret and design antibodies through natural language, advancing therapeutic discovery.

Contribution

It introduces the first comprehensive antibody instruction dataset with functional annotations, enabling improved understanding and de novo design of antibodies using LLMs.

Findings

01

Instruction tuning with AFD-Instruction improves LLM performance on antibody tasks.

02

The dataset links antibody sequences with functional descriptions.

03

Experiments show enhanced antibody understanding and design capabilities.

Abstract

Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. While instruction tuning for general proteins exists (e.g., InstructProtein, Mol-Instructions) , this paper correctly identifies antibodies as a unique and challenging class that existing resources do not adequately cover.

Weaknesses

1. The models (QwenAB, LLaMAB) show massive performance gains on classification tasks (e.g., in Table 1, 87.81% on Binding for QwenAB vs. 46.59% for GPT-4o). This dramatic improvement suggests the model might be learning spurious correlations or "template-fitting" rather than generalizable biological reasoning. The Understanding evaluation needs a more challenging, held-out test set to prove generalization. 2. The dataset is built by sampling 4,305 antibody entries from PDB and SAbDab. These dat

Reviewer 02Rating 4Confidence 3

Strengths

1 The paper introduces AFD-Instruction, the first large-scale instruction dataset for antibodies that pairs antibody sequences with structured natural language functional descriptions. This resource fills a key gap left by previous sequence-only datasets lacking functional or semantic supervision, enabling models to learn the mapping between sequence and function. 2 The authors propose a multi-agent pipeline—comprising Mr. Extractor, Dr. Mechanism, and Prof. Function—and combine self-questionin

Weaknesses

1 Although the AFD-Instruction dataset is large and comprehensive, its functional annotations mainly rely on literature-derived and database-extracted descriptions. These primarily cover common mechanisms such as neutralization, blocking, and binding-site recognition, but lack more fine-grained or dynamic functional information such as epitope escape, affinity modulation, or immune regulation. 2 The paper only validates its approach using two model families, Qwen and LLaMA, which limits the gen

Reviewer 03Rating 6Confidence 4

Strengths

1. High-quality domain focus. Antibody-specific dataset aligning sequences, functional text, and design tasks, and it is the first of its kind. 2. Comprehensive pipeline. Multi-agent extraction and self-questioning expansion show thoughtful integration of automation and expert verification. 3. Rich evaluation. Covers both understanding and design tasks, with structural and energetic validation using tFold + Rosetta. 4. Substantial empirical gains. Instruction-tuned QwenAB/LLaMAB outperforms gene

Weaknesses

### W1 Limited methodological novelty. The dataset-generation procedure largely extends patterns from Mol-Instructions and Evola (instruction synthesis via LLM prompting + self-questioning). The multi-agent framing repackages established extract–verify–summarize steps; conceptual contribution is modest beyond domain adaptation. ### W2 Small biological diversity, large linguistic inflation. The dataset derives 430 K instructions from only 4.3 K antibodies, meaning linguistic diversity far exceeds

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMonoclonal and Polyclonal Antibodies Research · vaccines and immunoinformatics approaches · Biochemical and Structural Characterization