XAlign: Cross-lingual Fact-to-Text Alignment and Generation for   Low-Resource Languages

Tushar Abhishek; Shivprasad Sagare; Bhavyajeet Singh; Anubhav Sharma,; Manish Gupta; Vasudeva Varma

arXiv:2202.00291·cs.CL·April 26, 2022·1 cites

XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages

Tushar Abhishek, Shivprasad Sagare, Bhavyajeet Singh, Anubhav Sharma,, Manish Gupta, Vasudeva Varma

PDF

Open Access 1 Repo 1 Datasets

TL;DR

XAlign introduces a new dataset and unsupervised methods for cross-lingual fact-to-text generation in low-resource languages, addressing a gap in multilingual natural language generation for underrepresented languages.

Contribution

The paper presents the first cross-lingual fact-to-text alignment and generation methods for low-resource languages, including a large dataset and baseline models.

Findings

01

XAlign dataset contains 0.45M pairs across 8 languages.

02

Manual annotation of 5402 pairs enhances alignment quality.

03

Baseline models demonstrate the feasibility of cross-lingual generation.

Abstract

Multiple critical scenarios (like Wikipedia text generation given English Infoboxes) need automated generation of descriptive text in low resource (LR) languages from English fact triples. Previous work has focused on English fact-to-text (F2T) generation. To the best of our knowledge, there has been no previous attempt on cross-lingual alignment or generation for LR languages. Building an effective cross-lingual F2T (XF2T) system requires alignment between English structured facts and LR sentences. We propose two unsupervised methods for cross-lingual alignment. We contribute XALIGN, an XF2T dataset with 0.45M pairs across 8 languages, of which 5402 pairs have been manually annotated. We also train strong baseline XF2T generation models on the XAlign dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tushar117/xalign
pytorchOfficial

Datasets

tushar117/xalign
dataset· 5.5k dl
5.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification