Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models
Xiang Zhang, Jiaqi Wei, Yuejin Yang, Zijie Qiu, Yuhan Chen, Zhiqiang Gao, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Wanli Ouyang, Chenyu You, Siqi Sun

TL;DR
This paper introduces reflection pretraining with auxiliary thinking tokens to enhance reasoning and self-correction in biological sequence models, overcoming expressiveness limitations of protein language models.
Contribution
It proposes a novel reflection pretraining method that expands token expressiveness, enabling reasoning and self-correction in protein and RNA language models.
Findings
Enhanced reasoning capacity through auxiliary thinking tokens.
Significant performance improvements over standard pretraining.
Theoretically demonstrates increased language expressiveness.
Abstract
Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Bioinformatics · AI-based Problem Solving and Planning
