Incorporating Context into Subword Vocabularies

Shaked Yehezkel; Yuval Pinter

arXiv:2210.07095·cs.CL·February 13, 2023·1 cites

Incorporating Context into Subword Vocabularies

Shaked Yehezkel, Yuval Pinter

PDF

Open Access 1 Repo

TL;DR

SaGe is a novel tokenizer that incorporates contextual information during vocabulary creation, leading to improved language model performance across multiple tasks and languages without significant efficiency loss.

Contribution

This paper introduces SaGe, a context-aware subword tokenizer that enhances token cohesion and model performance in diverse linguistic settings.

Findings

01

SaGe outperforms traditional tokenizers on English GLUE tasks.

02

SaGe improves NER and inference in Turkish.

03

SaGe maintains efficiency and domain robustness.

Abstract

Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

melelbgu/sage
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification