Introducing MELI: the Mandarin-English Language Interview Corpus
Suyuan Liu, Molly Babel

TL;DR
The MELI Corpus is a comprehensive open-source dataset of Mandarin-English bilingual speech, including recordings, transcriptions, and metadata, designed for linguistic and acoustic analysis.
Contribution
It introduces a new, richly annotated bilingual speech corpus with matched Mandarin and English sessions, supporting diverse linguistic research.
Findings
Contains 29.8 hours of speech data from 51 speakers
Includes detailed transcriptions and alignments for both languages
Documents code-switching patterns and speaker language attitudes
Abstract
We introduce the Mandarin-English Language Interview (MELI) Corpus, an open-source resource of 29.8 hours of speech from 51 Mandarin-English bilingual speakers. MELI combines matched sessions in Mandarin and English with two speaking styles: read sentences and spontaneous interviews about language varieties, standardness, and learning experiences. Audio was recorded at 44.1 kHz (16-bit, stereo). Interviews were fully transcribed, force-aligned at word and phone levels, and anonymized. Descriptively, the Mandarin component totals ~14.7 hours (mean duration 17.3 minutes) and the English component ~15.1 hours (mean duration 17.8 minutes). We report token/type statistics for each language and document code-switching patterns (frequent in Mandarin sessions; more limited in English sessions). The corpus design supports within-/cross-speaker, within/cross-language acoustic comparison and links…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
