TL;DR
This paper introduces CE-CLCNN, an end-to-end image-based character-level neural network for text classification in languages without clear word boundaries, achieving state-of-the-art results by capturing visual and semantic character similarities.
Contribution
The paper presents a novel image-based character encoder within an end-to-end CNN framework to improve text classification for complex languages.
Findings
Achieved state-of-the-art results on document classification tasks
Captured visually and semantically similar characters effectively
Demonstrated robustness without word segmentation
Abstract
For analysing and/or understanding languages having no word boundaries based on morphological analysis such as Japanese, Chinese, and Thai, it is desirable to perform appropriate word segmentation before word embeddings. But it is inherently difficult in these languages. In recent years, various language models based on deep learning have made remarkable progress, and some of these methodologies utilizing character-level features have successfully avoided such a difficult problem. However, when a model is fed character-level features of the above languages, it often causes overfitting due to a large number of character types. In this paper, we propose a CE-CLCNN, character-level convolutional neural networks using a character encoder to tackle these problems. The proposed CE-CLCNN is an end-to-end learning model and has an image-based character encoder, i.e. the CE-CLCNN handles each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
