ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression
Chuanliu Fan, Zicheng Ma, Jun Gao, Nan Yu, Jun Zhang, Ziqiang Cao, Yi Qin Gao, Guohong Fu

TL;DR
ProtTeX-CC introduces a two-stage compression framework that significantly reduces input length and enhances in-context learning capabilities of protein language models without altering the core model.
Contribution
It proposes a novel two-stage compression method that fuses sequence and structure data and aggregates demonstrations, improving ProtTeX's efficiency and generalization in few-shot settings.
Findings
Reduces protein input length by 50% with no performance loss.
Achieves over 93% compression of demonstration prompts.
Improves in-domain accuracy by 2% and out-of-domain by 11%.
Abstract
Recent advances in protein large language models, such as ProtTeX, represent both side-chain amino acids and backbone structure as discrete token sequences of residue length. While this design enables unified modeling of multimodal protein information, it suffers from two major limitations: (1) The concatenation of sequence and structure tokens approximately doubles the protein length and breaks the intrinsic residue-level alignment between modalities. (2) Constrained by the training corpus and limited context window, ProtTeX is typically trained on single-protein inputs, rendering it incompatible with in-context learning (ICL) and thus limiting its generalization capability. To address these issues, we propose ProtTeX-CC, a lightweight two-stage compression framework designed to enhance ProtTeX under few-shot settings. We first design a joint embedding compression mechanism that fuses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies
