ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression

Chuanliu Fan; Zicheng Ma; Jun Gao; Nan Yu; Jun Zhang; Ziqiang Cao; Yi Qin Gao; Guohong Fu

arXiv:2508.12212·cs.LG·August 19, 2025

ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression

Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu

PDF

Open Access

TL;DR

ProtTeX-CC introduces a two-stage compression framework that significantly reduces input length and enhances in-context learning capabilities of protein language models without altering the core model.

Contribution

It proposes a novel two-stage compression method that fuses sequence and structure data and aggregates demonstrations, improving ProtTeX's efficiency and generalization in few-shot settings.

Findings

01

Reduces protein input length by 50% with no performance loss.

02

Achieves over 93% compression of demonstration prompts.

03

Improves in-domain accuracy by 2% and out-of-domain by 11%.

Abstract

Recent advances in protein large language models, such as ProtTeX, represent both side-chain amino acids and backbone structure as discrete token sequences of residue length. While this design enables unified modeling of multimodal protein information, it suffers from two major limitations: (1) The concatenation of sequence and structure tokens approximately doubles the protein length and breaks the intrinsic residue-level alignment between modalities. (2) Constrained by the training corpus and limited context window, ProtTeX is typically trained on single-protein inputs, rendering it incompatible with in-context learning (ICL) and thus limiting its generalization capability. To address these issues, we propose ProtTeX-CC, a lightweight two-stage compression framework designed to enhance ProtTeX under few-shot settings. We first design a joint embedding compression mechanism that fuses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies