TL;DR
This paper introduces extit{Self-Injection}, a multi-scale framework with stacked LLMs that compresses and retrieves long-context information efficiently, enabling models trained on 8K tokens to handle inputs over 128K tokens effectively.
Contribution
The paper proposes a novel self-injection architecture with multi-grained context compression and retrieval, reducing memory and computation while extending context window capabilities.
Findings
extit{ extbf{Self-Injection} achieves superior or comparable performance on long-context benchmarks.
It reduces memory footprint and speeds up inference by 2-3 times.
Effective generalization to inputs exceeding 128K tokens despite training on 8K sequences.
Abstract
The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
