KnowCoder-X: Boosting Multilingual Information Extraction via Code
Yuxin Zuo, Wenxuan Jiang, Wenxuan Liu, Zixuan Li, Long Bai, Hanbin Wang, Yutao Zeng, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng

TL;DR
KnowCoder-X is a multilingual code-based large language model that significantly improves cross-lingual information extraction by standardizing schemas, using code generation, and fine-tuning on a large bilingual dataset, outperforming existing models.
Contribution
It introduces a novel code-based approach for universal multilingual IE, including schema standardization, cross-lingual instruction tuning, and a large bilingual dataset, achieving state-of-the-art results.
Findings
KnowCoder-X surpasses ChatGPT by 30.17% in cross-lingual IE.
It outperforms state-of-the-art models by 20.03%.
Demonstrates strong transferability across 29 unseen languages.
Abstract
Empirical evidence indicates that LLMs exhibit spontaneous cross-lingual alignment. However, although LLMs show promising cross-lingual alignment in Information Extraction (IE), a significant imbalance across languages persists, highlighting an underlying deficiency. To address this, we propose KnowCoder-X, a powerful code LLM with advanced cross-lingual and multilingual capabilities for universal IE. Firstly, it standardizes the representation of multilingual schemas using Python classes, ensuring a consistent ontology across different languages. Then, IE across languages is formulated as a unified code generation task. Secondly, we conduct IE cross-lingual alignment instruction tuning on the translated instance prediction task to enhance the model's cross-lingual transferability. During this phase, we also construct a high-quality and diverse bilingual IE parallel dataset with 257k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsALIGN · Ontology
