Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise   Linear Classification

Koichi Akabe; Shunsuke Kanda; Yusuke Oda; Shinsuke Mori

arXiv:2406.17185·cs.CL·June 26, 2024

Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

Koichi Akabe, Shunsuke Kanda, Yusuke Oda, Shinsuke Mori

PDF

Open Access 1 Repo

TL;DR

This paper introduces Vaporetto, a method that significantly accelerates Japanese tokenization by optimizing the pointwise linear classification framework, achieving 5.7 times faster processing without losing accuracy.

Contribution

Vaporetto presents novel optimizations for PLC-based Japanese tokenization, including array operations, memory-efficient automata, and pre-processing techniques, enhancing speed while maintaining accuracy.

Findings

01

Tokenization speed increased by 5.7 times

02

Maintained the same accuracy as previous methods

03

Optimizations are orthogonal and broadly applicable

Abstract

This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our approach optimizes tokenization by leveraging the characteristics of the PLC framework and the task definition. Our approach involves (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal pre-processing methods for reducing actual score calculation. Thus, our approach makes the tokenization speed 5.7 times faster than the current approach based on the same model without decreasing tokenization accuracy. Our implementation is available at https://github.com/daac-tools/vaporetto under the MIT or Apache-2.0 license.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

daac-tools/vaporetto
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Video Analysis and Summarization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings