MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic   Parsing

Longxu Dou; Yan Gao; Mingyang Pan; Dingzirui Wang; Wanxiang Che,; Dechen Zhan; Jian-Guang Lou

arXiv:2212.13492·cs.CL·December 29, 2022·1 cites

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

Longxu Dou, Yan Gao, Mingyang Pan, Dingzirui Wang, Wanxiang Che,, Dechen Zhan, Jian-Guang Lou

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces MultiSpider, the largest multilingual text-to-SQL dataset covering seven languages, analyzes language-specific challenges, and proposes a schema augmentation method that improves multilingual performance.

Contribution

It provides a comprehensive multilingual dataset for text-to-SQL and a schema augmentation framework to enhance cross-lingual performance.

Findings

01

Non-English languages experience a 6.1% accuracy drop.

02

The SAVe framework improves overall accuracy by 1.8%.

03

The performance gap across languages is reduced by 29.5%.

Abstract

Text-to-SQL semantic parsing is an important NLP task, which greatly facilitates the interaction between users and the database and becomes the key component in many human-computer interaction systems. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider, we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/ContextualSP
pytorchOfficial

Datasets

dreamerdeo/multispider
dataset· 63 dl
63 dl

Videos

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Web Data Mining and Analysis