Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource   Agglutinative Data-to-Text Generation

Francois Meyer; Jan Buys

arXiv:2403.07567·cs.CL·March 13, 2024·1 cites

Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation

Francois Meyer, Jan Buys

PDF

Open Access 2 Repos

TL;DR

This paper introduces T2X, a new dataset for isiXhosa data-to-text generation, along with a novel subword-based model and evaluation framework, highlighting the challenges of low-resource, agglutinative language modeling.

Contribution

It presents T2X dataset, a new architecture SSPG for agglutinative languages, and an evaluation framework, advancing low-resource language data-to-text research.

Findings

01

Pretrained language models underperform on T2X.

02

Fine-tuning translation models yields best results.

03

SSPG outperforms existing models for agglutinative languages.

Abstract

Most data-to-text datasets are for English, so the difficulties of modelling data-to-text for low-resource languages are largely unexplored. In this paper we tackle data-to-text for isiXhosa, which is low-resource and agglutinative. We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG, which presents a new linguistic context that shifts modelling demands to subword-driven techniques. We also develop an evaluation framework for T2X that measures how accurately generated text describes the data. This enables future users of T2X to go beyond surface-level metrics in evaluation. On the modelling side we explore two classes of methods - dedicated data-to-text models trained from scratch and pretrained language models (PLMs). We propose a new dedicated architecture aimed at agglutinative data-to-text, the Subword Segmental Pointer Generator (SSPG). It jointly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling