MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages
Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay, Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa, Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan, Tur, Prem Natarajan

TL;DR
The paper introduces MASSIVE, a large-scale multilingual dataset with 1 million virtual assistant utterances across 51 languages, enabling improved multilingual natural language understanding and evaluation.
Contribution
It provides a new, extensive multilingual dataset with diverse languages and domains, along with baseline modeling results and publicly released resources.
Findings
XLM-R and mT5 achieve competitive accuracy on the dataset.
The dataset covers 51 languages, 18 domains, 60 intents, and 55 slots.
Public release of dataset and models facilitates future research.
Abstract
We present the MASSIVE dataset--Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Sentiment Analysis and Opinion Mining
MethodsMulti-Head Attention · Attention Is All You Need · XLM-R · Linear Layer · Byte Pair Encoding · Adafactor · Softmax · Dropout · Inverse Square Root Schedule · Layer Normalization
