MASSIVE: A 1M-Example Multilingual Natural Language Understanding   Dataset with 51 Typologically-Diverse Languages

Jack FitzGerald; Christopher Hench; Charith Peris; Scott Mackie; Kay; Rottmann; Ana Sanchez; Aaron Nash; Liam Urbach; Vishesh Kakarala; Richa; Singh; Swetha Ranganath; Laurie Crist; Misha Britan; Wouter Leeuwis; Gokhan; Tur; Prem Natarajan

arXiv:2204.08582·cs.CL·June 20, 2022·22 cites

MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay, Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa, Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan, Tur, Prem Natarajan

PDF

Open Access 5 Repos 2 Models 5 Datasets

TL;DR

The paper introduces MASSIVE, a large-scale multilingual dataset with 1 million virtual assistant utterances across 51 languages, enabling improved multilingual natural language understanding and evaluation.

Contribution

It provides a new, extensive multilingual dataset with diverse languages and domains, along with baseline modeling results and publicly released resources.

Findings

01

XLM-R and mT5 achieve competitive accuracy on the dataset.

02

The dataset covers 51 languages, 18 domains, 60 intents, and 55 slots.

03

Public release of dataset and models facilitates future research.

Abstract

We present the MASSIVE dataset--Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining

MethodsMulti-Head Attention · Attention Is All You Need · XLM-R · Linear Layer · Byte Pair Encoding · Adafactor · Softmax · Dropout · Inverse Square Root Schedule · Layer Normalization