On the Effect of Token Merging on Pre-trained Models for Code
Mootez Saad, Hao Li, Tushar Sharma, Ahmed E. Hassan

TL;DR
This paper explores token merging strategies to reduce computational overhead in code language models, analyzing their impact on efficiency and task performance across multiple models and tasks.
Contribution
It introduces two novel token merging strategies for code models and evaluates their effects on efficiency and accuracy across six models and three tasks.
Findings
Reduced floating-point operations by 1% to 19%.
Minor performance degradation in vulnerability detection.
Performance improvement in code translation with +2.47 points.
Abstract
Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from classification to generation. However, the output of these tokenizers is often longer than that traditionally used in compilers and interpreters. This could result in undesirable effects, such as increased computational overhead. In this work, we investigate the effect of merging the hidden representations of subtokens that belong to the same semantic unit, such as subtokens that form a single identifier. We propose two strategies: one based on averaging the representations and another that leverages a learning-based approach. Both methods can be seamlessly integrated with existing language models for code. We conduct experiments using six language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
