Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

Guojie Liu; Yiqi Wang; Yanfeng Yang; Wenqi Fan; Songlei Jian; Jianfeng Zhang; Jie Yu

arXiv:2602.13980·cs.AI·February 17, 2026

Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking

Guojie Liu, Yiqi Wang, Yanfeng Yang, Wenqi Fan, Songlei Jian, Jianfeng Zhang, Jie Yu

PDF

Open Access

TL;DR

This paper introduces Parallelized Iterative Compression (PIC), a method that improves soft prompt compression for large language models by restricting attention to local chunks, leading to better performance and faster training especially at high compression ratios.

Contribution

PIC modifies the attention mask in transformers to restrict receptive fields, enabling more effective local context compression and reducing training complexity compared to global compression methods.

Findings

01

PIC outperforms baselines in downstream tasks, especially at high compression ratios.

02

PIC achieves up to 29.8% improvement in F1 score and 40.7% in EM score on QA tasks at 64x compression.

03

Training time is reduced by approximately 40% when training the compressor.

Abstract

Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare