CA-EHN: Commonsense Analogy from E-HowNet
Peng-Hsuan Li, Tsan-Yu Yang, Wei-Yun Ma

TL;DR
This paper introduces CA-EHN, a large-scale Chinese word analogy dataset based on E-HowNet, to evaluate how well word embeddings capture commonsense knowledge beyond traditional handcrafted datasets.
Contribution
It creates the first large-scale commonsense analogy dataset from E-HowNet, enabling better evaluation of word representations for commonsense reasoning.
Findings
CA-EHN contains 90,505 analogies across 763 relations.
The dataset effectively evaluates commonsense embedding quality.
Experiments demonstrate its usefulness as an indicator of embedding performance.
Abstract
Embedding commonsense knowledge is crucial for end-to-end models to generalize inference beyond training corpora. However, existing word analogy datasets have tended to be handcrafted, involving permutations of hundreds of words with only dozens of pre-defined relations, mostly morphological relations and named entities. In this work, we model commonsense knowledge down to word-level analogical reasoning by leveraging E-HowNet, an ontology that annotates 88K Chinese words with their structured sense definitions and English translations. We present CA-EHN, the first commonsense word analogy dataset containing 90,505 analogies covering 5,656 words and 763 relations. Experiments show that CA-EHN stands out as a great indicator of how well word representations embed commonsense knowledge. The dataset is publicly available at https://github.com/ckiplab/CA-EHN.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
