Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification
Koichi Akabe, Shunsuke Kanda, Yusuke Oda, Shinsuke Mori

TL;DR
This paper introduces Vaporetto, a method that significantly accelerates Japanese tokenization by optimizing the pointwise linear classification framework, achieving 5.7 times faster processing without losing accuracy.
Contribution
Vaporetto presents novel optimizations for PLC-based Japanese tokenization, including array operations, memory-efficient automata, and pre-processing techniques, enhancing speed while maintaining accuracy.
Findings
Tokenization speed increased by 5.7 times
Maintained the same accuracy as previous methods
Optimizations are orthogonal and broadly applicable
Abstract
This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our approach optimizes tokenization by leveraging the characteristics of the PLC framework and the task definition. Our approach involves (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal pre-processing methods for reducing actual score calculation. Thus, our approach makes the tokenization speed 5.7 times faster than the current approach based on the same model without decreasing tokenization accuracy. Our implementation is available at https://github.com/daac-tools/vaporetto under the MIT or Apache-2.0 license.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
