CoLa: Chinese Character Decomposition with Compositional Latent Components
Fan Shi, Haiyang Yu, Bin Li, Xiangyang Xue

TL;DR
CoLa is a novel deep latent variable model that learns to decompose Chinese characters into compositional components without predefined schemes, enabling effective zero-shot recognition and cross-dataset generalization.
Contribution
It introduces a learning-to-learn approach for Chinese character decomposition, surpassing prior methods that relied on human-defined schemes, thus improving zero-shot recognition capabilities.
Findings
Outperforms previous methods in zero-shot CCR.
Learned components reflect character structure interpretably.
Generalizes to historical oracle bone characters.
Abstract
Humans can decompose Chinese characters into compositional components and recombine them to recognize unseen characters. This reflects two cognitive principles: Compositionality, the idea that complex concepts are built on simpler parts; and Learning-to-learn, the ability to learn strategies for decomposing and recombining components to form new concepts. These principles provide inductive biases that support efficient generalization. They are critical to Chinese character recognition (CCR) in solving the zero-shot problem, which results from the common long-tail distribution of Chinese character datasets. Existing methods have made substantial progress in modeling compositionality via predefined radical or stroke decomposition. However, they often ignore the learning-to-learn capability, limiting their ability to generalize beyond human-defined schemes. Inspired by these principles, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
