DeCAL Tokenwise Compression

Sameer Panwar

arXiv:2508.08514·cs.CL·October 23, 2025

DeCAL Tokenwise Compression

Sameer Panwar

PDF

Open Access 3 Reviews

TL;DR

DeCAL is a tokenwise compression method that uses a denoising pretrained encoder-decoder model to produce high-quality compressed representations, achieving comparable performance to uncompressed models at 2x compression and minor drops up to 8x across various NLP tasks.

Contribution

DeCAL introduces a novel tokenwise compression technique with minimal encoder modifications, enabling efficient dense representations without significant performance loss.

Findings

01

DeCAL at 2x compression matches uncompressed performance on several tasks.

02

Up to 8x compression, only minor metric drops observed.

03

Significant computational savings for dense representation tasks.

Abstract

This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations from the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on several downstream tasks, with usually only a minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- performed experiments on a several benchmark tasks - shown variations on different compression ratios - comapred with other compression methods - paper is well written and easy to follow

Weaknesses

- The comparson is limited to CCProxy and AttnPool 2x. The justification for CCProxy hyper-parameters are not good enough (variations in the models should be used, ablation needed). The existing methods like NUGGET show much higher performances. Also the experiments are shown on two datasets only. Ablaton needed. - What if the T5 as a basemodel is changed? is this method generalizable to other architectures? Ablaton neeed. - for summarization/ tasks, more useful metrics should be used

Reviewer 02Rating 4Confidence 4

Strengths

Overall, I believe that this paper is borderline regarding acceptance to ICLR. On the positive side, I think that the paper is well written and easy to follow. The proposed method is simple and sound, and the practical implementation is solid. I enjoyed the ablation experiments, and believe that its conclusion could impact the future development of learning compressed text representation. More precisely, the paper show that: - training an encoder from scratch leads to better results than fine-t

Weaknesses

But on the other hand, I also believe that the paper suffers from some limitations. First, as currently proposed in the paper, the DeCAL method is strictly more computationally expensive than the T5 baseline, or other methods used as comparison in the paper. This limits the usability in practice of DeCAL, and I think that it would make the paper stronger to discuss (and potentially explore) how this could be useful in real world settings. A second limitation is that, I believe that a different

Reviewer 03Rating 4Confidence 3

Strengths

- The paper summarizes well the different related works on compression of token sequences. - The paper is easy to follow. - The paper present a significant number of experiments: from pre-training with different compression ratios, experiments with others methods like AttnPool or the custom CCProxy approaches, to a number of finetuning experiments on several tasks including Document-based QA, Summarization and Retrieval. - The paper presents a few ablations on modelling choices in section 4.4 -

Weaknesses

The paper method lacks of novelty: - Extracting compressed representation with self-attention has been widely used in other works (cited and presented by the authors) like COCOM with CTX tokens as well as ICAE with memory-tokens. - Span denoising tasks with encoder-decoder models has also been widely studied Novelty lies in the combination of these two, but a reader would expect a more thorough analysis of why this method could be superior to previous work, what component is critical, what pe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Speech Recognition and Synthesis