When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations
Avinash Goutham Aluguvelly

TL;DR
This paper investigates how informal language forms like slang and emojis impact NLI model accuracy, identifying tokenization failures as key issues and proposing targeted preprocessing and training strategies for mitigation.
Contribution
It reveals distinct failure modes caused by informal text in NLI and demonstrates effective hybrid mitigation techniques combining preprocessing and augmentation.
Findings
Emoji causes high tokenization failure, destroying input signal.
Preprocessing normalization recovers emoji accuracy.
Hybrid training improves robustness on informal variants.
Abstract
We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., "going to" -> "gonna", "friend" -> "homie") causes minimal degradation (at most 1.1pp): slang vocabulary falls largely within WordPiece coverage, so the tokenizer handles it without signal loss. Emoji replaces content words with Unicode characters that ELECTRA's WordPiece tokenizer maps to [UNK], destroying the input signal before any learned parameters see it (93.6% of emoji examples contain at least one [UNK], mean 2.91 per example). Noise tokens (no cap, deadass, tbh) are fully in-vocabulary but absent from NLI training data, consistent with the model assigning them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
