Probabilistic Linguistic Knowledge and Token-level Text Augmentation

Zhengxiang Wang

arXiv:2306.16644·cs.CL·July 4, 2023

Probabilistic Linguistic Knowledge and Token-level Text Augmentation

Zhengxiang Wang

PDF

Open Access

TL;DR

This study evaluates token-level text augmentation techniques and the impact of probabilistic linguistic knowledge, finding limited effectiveness of these methods across different models and languages in a question matching task.

Contribution

It introduces REDA and REDA_NG augmentation methods and provides a comprehensive evaluation of their effectiveness and the role of probabilistic linguistic knowledge.

Findings

01

Token-level augmentation techniques are generally ineffective.

02

Probabilistic linguistic knowledge has minimal impact.

03

Results are consistent across Chinese and English datasets.

Abstract

This paper investigates the effectiveness of token-level text augmentation and the role of probabilistic linguistic knowledge within a linguistically-motivated evaluation context. Two text augmentation programs, REDA and REDA $_{N G}$ , were developed, both implementing five token-level text editing operations: Synonym Replacement (SR), Random Swap (RS), Random Insertion (RI), Random Deletion (RD), and Random Mix (RM). REDA $_{N G}$ leverages pretrained $n$ -gram language models to select the most likely augmented texts from REDA's output. Comprehensive and fine-grained experiments were conducted on a binary question matching classification task in both Chinese and English. The results strongly refute the general effectiveness of the five token-level text augmentation techniques under investigation, whether applied together or separately, and irrespective of various common classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification