Boosting Zero-Shot Crosslingual Performance using LLM-Based   Augmentations with Effective Data Selection

Barah Fazili; Ashish Sunil Agrawal; Preethi Jyothi

arXiv:2407.10582·cs.CL·July 16, 2024

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection

Barah Fazili, Ashish Sunil Agrawal, Preethi Jyothi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a method that uses large language models to generate and select task-specific data for low-resource languages, significantly improving zero-shot cross-lingual performance in NLP tasks.

Contribution

It proposes a data selection strategy leveraging teacher model labels to identify diverse, representative LLM-generated data, enhancing zero-shot transfer for low-resource languages.

Findings

01

Achieved up to 7.13 absolute points improvement in sentiment analysis.

02

Improved zero-shot natural language inference accuracy by up to 1.5 points.

03

Effective data selection strategies outperform using all LLM generations.

Abstract

Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher's label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csalt-research/llm-based-augmentations-with-effective-data-selection
pytorchOfficial

Videos

Boosting Zero-Shot Crosslingual Performance using LLM-Based Augmentations with Effective Data Selection· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training