HashFormers: Towards Vocabulary-independent Pre-trained Transformers
Huiyin Xue, Nikolaos Aletras

TL;DR
HashFormers introduce a vocabulary-independent pre-trained transformer architecture that uses hashing to significantly reduce memory usage while maintaining competitive performance on text classification tasks.
Contribution
This work presents HashFormers, a novel pre-trained transformer model that employs hashing functions to eliminate the need for large embedding matrices, enabling unlimited vocabulary support.
Findings
HashFormers are more memory efficient than standard models.
They achieve comparable performance on text classification tasks.
The most efficient variant uses only 99.1K parameters with minimal performance loss.
Abstract
Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. This one-to-one mapping results into embedding matrices that occupy a lot of memory (i.e. millions of parameters) and grow linearly with the size of the vocabulary. Previous work on on-device transformers dynamically generate token embeddings on-the-fly without embedding matrices using locality-sensitive hashing over morphological information. These embeddings are subsequently fed into transformer layers for text classification. However, these methods are not pre-trained. Inspired by this line of work, we propose HashFormers, a new family of vocabulary-independent pre-trained transformers that support an unlimited vocabulary (i.e. all possible tokens in a corpus) given a substantially smaller fixed-sized embedding matrix. We achieve this by first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
