Exploiting News Article Structure for Automatic Corpus Generation of Entailment Datasets
Jan Christian Blaise Cruz, Jose Kristian Resabal, James Lin, Dan John, Velasco, Charibeth Cheng

TL;DR
This paper introduces a method to automatically generate NLI datasets from news articles for low-resource languages, creates the NewsPH-NLI dataset for Filipino, and evaluates transformer models to improve NLP performance in such languages.
Contribution
It presents a novel methodology for dataset creation from news articles, introduces the NewsPH-NLI dataset, and benchmarks new transformer models for Filipino NLP.
Findings
NewsPH-NLI is the first Filipino entailment dataset.
Pretrained transformers improve low-resource language NLP.
Transfer learning performance varies with data scarcity.
Abstract
Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsLinear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Residual Connection · Attention Is All You Need · WordPiece · Softmax
