Joint Optimization of Tokenization and Downstream Model
Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki, Okazaki

TL;DR
This paper introduces a joint optimization approach for tokenization and downstream NLP models, enhancing task performance by adapting tokenization to specific models and tasks, applicable in training and post-processing.
Contribution
It presents a novel, flexible method to optimize tokenization jointly with models, applicable across various NLP tasks and languages, without restrictions beyond using model loss values.
Findings
Improved text classification accuracy across three languages.
Enhanced machine translation performance in eight language pairs.
Applicable to any NLP task through loss-based optimization.
Abstract
Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance. In this paper, we propose a novel method to find an appropriate tokenization to a given downstream model by jointly optimizing a tokenizer and the model. The proposed method has no restriction except for using loss values computed by the downstream model to train the tokenizer, and thus, we can apply the proposed method to any NLP task. Moreover, the proposed method can be used to explore the appropriate tokenization for an already trained model as post-processing. Therefore, the proposed method is applicable to various situations. We evaluated whether our method contributes to improving performance on text classification in three languages and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
