DeCAL Tokenwise Compression
Sameer Panwar

TL;DR
DeCAL is a tokenwise compression method that uses a denoising pretrained encoder-decoder model to produce high-quality compressed representations, achieving comparable performance to uncompressed models at 2x compression and minor drops up to 8x across various NLP tasks.
Contribution
DeCAL introduces a novel tokenwise compression technique with minimal encoder modifications, enabling efficient dense representations without significant performance loss.
Findings
DeCAL at 2x compression matches uncompressed performance on several tasks.
Up to 8x compression, only minor metric drops observed.
Significant computational savings for dense representation tasks.
Abstract
This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations from the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on several downstream tasks, with usually only a minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.
Peer Reviews
Decision·Submitted to ICLR 2026
- performed experiments on a several benchmark tasks - shown variations on different compression ratios - comapred with other compression methods - paper is well written and easy to follow
- The comparson is limited to CCProxy and AttnPool 2x. The justification for CCProxy hyper-parameters are not good enough (variations in the models should be used, ablation needed). The existing methods like NUGGET show much higher performances. Also the experiments are shown on two datasets only. Ablaton needed. - What if the T5 as a basemodel is changed? is this method generalizable to other architectures? Ablaton neeed. - for summarization/ tasks, more useful metrics should be used
Overall, I believe that this paper is borderline regarding acceptance to ICLR. On the positive side, I think that the paper is well written and easy to follow. The proposed method is simple and sound, and the practical implementation is solid. I enjoyed the ablation experiments, and believe that its conclusion could impact the future development of learning compressed text representation. More precisely, the paper show that: - training an encoder from scratch leads to better results than fine-t
But on the other hand, I also believe that the paper suffers from some limitations. First, as currently proposed in the paper, the DeCAL method is strictly more computationally expensive than the T5 baseline, or other methods used as comparison in the paper. This limits the usability in practice of DeCAL, and I think that it would make the paper stronger to discuss (and potentially explore) how this could be useful in real world settings. A second limitation is that, I believe that a different
- The paper summarizes well the different related works on compression of token sequences. - The paper is easy to follow. - The paper present a significant number of experiments: from pre-training with different compression ratios, experiments with others methods like AttnPool or the custom CCProxy approaches, to a number of finetuning experiments on several tasks including Document-based QA, Summarization and Retrieval. - The paper presents a few ablations on modelling choices in section 4.4 -
The paper method lacks of novelty: - Extracting compressed representation with self-attention has been widely used in other works (cited and presented by the authors) like COCOM with CTX tokens as well as ICAE with memory-tokens. - Span denoising tasks with encoder-decoder models has also been widely studied Novelty lies in the combination of these two, but a reader would expect a more thorough analysis of why this method could be superior to previous work, what component is critical, what pe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Speech Recognition and Synthesis
