CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Shohei Higashiyama; Masao Ideuchi; Masao Utiyama

arXiv:2603.29336·cs.CL·April 1, 2026

CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

Shohei Higashiyama, Masao Ideuchi, Masao Utiyama

PDF

TL;DR

This paper introduces CADEL, a new annotated corpus of Japanese web documents designed to improve entity linking systems by providing rich, high-quality linguistic annotations specific to Japan.

Contribution

It presents a novel corpus creation methodology and an annotated dataset for Japanese entity linking, addressing resource scarcity and supporting system evaluation.

Findings

01

High inter-annotator agreement confirms annotation consistency.

02

Preliminary experiments show the corpus includes many challenging disambiguation cases.

Abstract

Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.