Chunk-Distilled Language Modeling
Yanhong Li, Karen Livescu, Jiawei Zhou

TL;DR
Chunk-Distilled Language Modeling (CD-LM) enhances large language models by enabling multi-token chunk generation and flexible data adaptation, improving efficiency and control without extra training.
Contribution
The paper introduces CD-LM, a novel method combining retrieval with deep LLMs to generate multi-token chunks and adapt to new data efficiently.
Findings
Improves language model performance across various tasks.
Reduces generation time by producing multi-token chunks.
Allows flexible domain-specific data integration.
Abstract
We introduce Chunk-Distilled Language Modeling (CD-LM), an approach to text generation that addresses two challenges in current large language models (LLMs): the inefficiency of token-level generation, and the difficulty of adapting to new data and knowledge. Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multi-token text chunks at a single decoding step. Our retrieval framework enables flexible construction of model- or domain-specific datastores, either leveraging the internal knowledge of existing models, or incorporating expert insights from human-annotated corpora. This adaptability allows for enhanced control over the language model's distribution without necessitating additional training. We present the CD-LM formulation along with performance metrics demonstrating its ability to improve language model…
Peer Reviews
Decision·ICLR 2025 Poster
The proposed CD-LM is well-designed and technically sound. The paper is well-written and easy to follow. The authors also conducted extensive experiments in the appendix, lending additional robustness to their findings.
While the algorithm in Section 3.2 appears reasonable, it cannot ensure the same properties as speculative decoding, namely that sampling $ x $ from $ p(x) $ is equivalent to sampling from $ q(x) $. In other words, sampling from the chunk proposal model may introduce a distribution shift, potentially reducing performance. The self-distillation experiments deepen this concern: as shown in Figure 6, saving about 20% of mean token time results in a significant performance drop across many LLMs, as
Originality: - The authors introduce a new decoding paradigm that is quite novel, and has the ability to do fine-grained, phrase-level, grounding to an external source without incurring any extra context length like RAG does, all while potentially saving decoding time by skipping decoding steps. In essence, CD-LM introduces a phrase-level cache with fuzzy matching. This cache essentially gives the model the ability to autocomplete from the cache with some confidence threshold. The flexibility of
1. The limitations of the experiments in this paper should be stated more clearly. Namely: - The contexts that are used to construct the chunk datastore / cache aligns well with the evaluation setting in every experiment in the paper. It is not possible to tell what the pitfalls of the retrieval mechanism might be when its pushed to its limit with very large collections of tries. For example in real applications, if CD-LM was used to cache the top 20% of queries with traffic (covering most top
1) Compared with traditional token-level generation methods, CD-LM can generate multiple consecutive chunks in a single decoding step, thereby significantly reducing the number of decoding steps required and reducing inference overhead. This is particularly important when dealing with a large number of long text generation tasks in the current large language model scenario. 2) The authors effectively extend the chunk-based generation method to three very important application scenarios: (1) By
1) Novelty: CD-LM completely follows the Copy Generator framework proposed in the paper "Copy is all you need", and its novelty is slightly insufficient. However, as I mentioned in the Strengths part, they successfully extend the application scenarios of CoG. Therefore, I think the contribution of this paper is still good. 2) Lack of important reference: Nearest Neighbor Speculative Decoding for LLM Generation and Attribution: This paper also proposes applying chunk-level generation mechanisms
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
