Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Houxing Ren; Mingjie Zhan; Zhongyuan Wu; Hongsheng Li

arXiv:2405.17103·cs.CL·June 17, 2024

Empowering Character-level Text Infilling by Eliminating Sub-Tokens

Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Hongsheng Li

PDF

Open Access 1 Repo 4 Models 1 Video

TL;DR

This paper introduces FIM-SE, a novel method for character-level text infilling that eliminates sub-token prediction during inference, significantly improving performance over previous approaches.

Contribution

The paper proposes FIM-SE, a line-level infilling approach with special tokens, to effectively perform character-level infilling without sub-token prediction, enhancing accuracy and guidance.

Findings

01

FIM-SE outperforms previous methods in character-level infilling tasks.

02

The line-level format reduces the perplexity associated with sub-token prediction.

03

Incorporating special tokens improves generation guidance and overall performance.

Abstract

In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sensellm/fim-se
noneOfficial

Models

Videos

Empowering Character-level Text Infilling by Eliminating Sub-Tokens· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification