CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling
Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau,, Yang Gao, Jackie Chi Kit Cheung

TL;DR
CItruS is a novel method that improves long sequence modeling by intelligently evicting hidden states based on downstream task relevance, enhancing task performance without sacrificing language modeling perplexity.
Contribution
It introduces Chunked Instruction-aware State Eviction (CItruS), a training-free technique that incorporates attention preferences into state eviction for better downstream task performance.
Findings
Outperforms strong baselines on long sequence comprehension tasks.
Maintains language modeling perplexity while improving downstream task results.
Efficient chunked sequence processing enhances performance and speed.
Abstract
Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) without affecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance, often drop information that is important for solving downstream tasks, a problem which we call information neglect. To address this issue, we introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. In addition, we design a method for chunked sequence processing to further improve efficiency. Our training-free method exhibits superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Time Series Analysis and Forecasting · Parallel Computing and Optimization Techniques
MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
