Empowering Character-level Text Infilling by Eliminating Sub-Tokens
Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Hongsheng Li

TL;DR
This paper introduces FIM-SE, a novel method for character-level text infilling that eliminates sub-token prediction during inference, significantly improving performance over previous approaches.
Contribution
The paper proposes FIM-SE, a line-level infilling approach with special tokens, to effectively perform character-level infilling without sub-token prediction, enhancing accuracy and guidance.
Findings
FIM-SE outperforms previous methods in character-level infilling tasks.
The line-level format reduces the perplexity associated with sub-token prediction.
Incorporating special tokens improves generation guidance and overall performance.
Abstract
In infilling tasks, sub-tokens, representing instances where a complete token is segmented into two parts, often emerge at the boundaries of prefixes, middles, and suffixes. Traditional methods focused on training models at the token level, leading to sub-optimal performance in character-level infilling tasks during the inference stage. Alternately, some approaches considered character-level infilling, but they relied on predicting sub-tokens in inference, yet this strategy diminished ability in character-level infilling tasks due to the large perplexity of the model on sub-tokens. In this paper, we introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference. In addition, we incorporate two special…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
