Revisiting subword tokenization: A case study on affixal negation in   large language models

Thinh Hung Truong; Yulia Otmakhova; Karin Verspoor; Trevor Cohn,; Timothy Baldwin

arXiv:2404.02421·cs.CL·April 5, 2024·1 cites

Revisiting subword tokenization: A case study on affixal negation in large language models

Thinh Hung Truong, Yulia Otmakhova, Karin Verspoor, Trevor Cohn,, Timothy Baldwin

PDF

Open Access 1 Video

TL;DR

This paper investigates how different subword tokenization methods affect large language models' ability to understand affixal negation, revealing that models generally recognize negation despite tokenization challenges.

Contribution

It provides a comprehensive analysis of the impact of tokenization on negation understanding in LLMs, highlighting the interaction between tokenization accuracy and negation detection.

Findings

01

Models can reliably recognize affixal negation despite tokenization mismatches.

02

Tokenization performance does not always correlate with negation detection accuracy.

03

Different subword tokenization methods have varying effects on negation sensitivity.

Abstract

In this work, we measure the impact of affixal negation on modern English large language models (LLMs). In affixal negation, the negated meaning is expressed through a negative morpheme, which is potentially challenging for LLMs as their tokenizers are often not morphologically plausible. We conduct extensive experiments using LLMs with different subword tokenization methods, which lead to several insights on the interaction between tokenization performance and negation sensitivity. Despite some interesting mismatches between tokenization accuracy and negation detection performance, we show that models can, on the whole, reliably recognize the meaning of affixal negation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Revisiting subword tokenization: A case study on affixal negation in large language models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling