Introducing MELI: the Mandarin-English Language Interview Corpus

Suyuan Liu; Molly Babel

arXiv:2603.27043·cs.CL·May 18, 2026

Introducing MELI: the Mandarin-English Language Interview Corpus

Suyuan Liu, Molly Babel

PDF

TL;DR

The MELI Corpus is a comprehensive open-source dataset of Mandarin-English bilingual speech, including recordings, transcriptions, and metadata, designed for linguistic and acoustic analysis.

Contribution

It introduces a new, richly annotated bilingual speech corpus with matched Mandarin and English sessions, supporting diverse linguistic research.

Findings

01

Contains 29.8 hours of speech data from 51 speakers

02

Includes detailed transcriptions and alignments for both languages

03

Documents code-switching patterns and speaker language attitudes

Abstract

We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.