CRAFT: Extracting and Tuning Cultural Instructions from the Wild

Bin Wang; Geyu Lin; Zhengyuan Liu; Chengwei Wei; Nancy F. Chen

arXiv:2405.03138·cs.CL·July 11, 2024

CRAFT: Extracting and Tuning Cultural Instructions from the Wild

Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, Nancy F. Chen

PDF

Open Access 2 Repos 2 Datasets 1 Video

TL;DR

This paper presents a novel pipeline for extracting culturally-related instruction datasets from unstructured data, improving language models' cultural reasoning, especially for underrepresented regions, demonstrated by experiments in Singapore, the Philippines, and the US.

Contribution

It introduces a self-instruction generation pipeline for extracting high-quality cultural instruction datasets from unstructured corpora, enhancing LLMs' regional cultural understanding.

Findings

01

Up to 6% performance improvement in regional cultural reasoning

02

Effective extraction of cultural concepts from unstructured data

03

Enhanced LLM capabilities in recognizing regional nuances

Abstract

Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models' cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

CRAFT: Extracting and Tuning Cultural Instructions from the Wild· underline

Taxonomy

TopicsDiverse Musicological Studies · Music and Audio Processing