Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

Kai Zhuang; Jiawei Zhang; Yumou Liu; Hanqun Cao; Chunbin Gu; Mengdi Liu; Zhangyang Gao; Zitong Jerry Wang; Xuanhe Zhou; Pheng-Ann Heng; Lijun Wu; Conghui He; Cheng Tan

arXiv:2510.23127·cs.AI·October 31, 2025

Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

Kai Zhuang, Jiawei Zhang, Yumou Liu, Hanqun Cao, Chunbin Gu, Mengdi Liu, Zhangyang Gao, Zitong Jerry Wang, Xuanhe Zhou, Pheng-Ann Heng, Lijun Wu, Conghui He, Cheng Tan

PDF

1 Datasets 3 Reviews

TL;DR

This paper demonstrates that providing scientific large language models with high-level structured bioinformatics context significantly improves biological reasoning, surpassing raw sequence input, and suggests a paradigm shift towards reasoning over expert knowledge.

Contribution

It introduces a novel approach of using structured bioinformatics context with Sci-LLMs, showing superior reasoning performance over sequence-based inputs.

Findings

01

Context-only input outperforms sequence-only and combined modes.

02

Raw sequences act as informational noise for models.

03

Reframes Sci-LLMs as reasoning engines over structured knowledge.

Abstract

Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper's "tokenization dilemma" concept is a really clear and smart way to frame a major hurdle for Sci-LLMs. The main idea—that feeding LLMs text context from tools like BLAST is better than giving them the raw sequence—is surprising but backed up well by the experiments. The finding that raw sequences just add "noise" and make things worse is a big deal. The visualizations (like in Figure 3) showing how alignment fails are also very convincing . This work is important because it questions t

Weaknesses

The main drawback, which the authors rightly point out, is that this method can't handle mutation effect prediction. The bio-tools (BLAST, etc.) used to create the context just aren't sensitive to tiny, single-point changes, so the context for a normal protein and its mutant look the same . This is a major limitation, as it rules out a big area of computational biology. Also, the claims about it working on DNA are mostly tucked away in the appendix, not fully explored in the main paper.

Reviewer 02Rating 6Confidence 3

Strengths

1. Overall: The paper and aims to address a novel challenge in the Sci-LLM space, making a case that Sci-LLMs are better served as “reasoning engines over expert knowledge”, rather than pure sequence decoders. While this is noted and there is some evidence that this is the case, it does raise some circular logic around the quality of the annotations derived from the bioinformatics knowledgebases (addressed below in the weaknesses). 2. Generalizability: The solution in generalizable, with applica

Weaknesses

1. Circular Logic: The approach works well when high-quality annotations exist, yet the solution also exists to propose annotations to fill in knowledge gaps. This counter-intuitively raises a bit of a “Catch 22” scenario. 2. Core Argument: The basis of the manuscript suggests that there is in fact valuable information encoded within the evolutionary language through sequence tokens, yet the results suggest the opposite – and that human context exclusively drives the value.

Reviewer 03Rating 6Confidence 4

Strengths

Pros: - The authors proposed a new “context-only” method, which achieved significantly - The context-driven approach achieve good performance.

Weaknesses

Cons: - Context-only approach sounds interesting. However, compared with raw biomolecular sequences input, an inevitable con of this approach would be significant information loss (by discarding too many detailed information). - The capability of this approach is capped by the bioinformatics tools being used, e.g., InterProScan and BLAST. - As the context-only model relies majority on prior, it may not be a good tool for exploring “novel” findings (which may be out of distribution a bit). - Why

Code & Models

Datasets

OpenRaiser/CoKE
dataset· 68 dl
68 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.