T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse   Representations for Memory-Efficient Embeddings

Bj\"orn Deiseroth; Manuel Brack; Patrick Schramowski; Kristian; Kersting; Samuel Weinbach

arXiv:2406.19223·cs.CL·January 8, 2025

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Bj\"orn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian, Kersting, Samuel Weinbach

PDF

Open Access 1 Repo

TL;DR

T-FREE introduces a tokenizer-free approach for generative language models that uses sparse character triplet representations, significantly reducing parameters and improving cross-lingual transfer without relying on traditional tokenizers.

Contribution

The paper presents T-FREE, a novel subword tokenizer-free method that embeds words via sparse patterns over character triplets, enhancing memory efficiency and cross-lingual performance.

Findings

01

Achieves over 85% reduction in embedding layer parameters.

02

Maintains competitive downstream task performance.

03

Shows significant improvements in cross-lingual transfer learning.

Abstract

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aleph-alpha/trigrams
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques