Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Srija Anand; Ashwin Sankar; Ishvinder Sethi; Aaditya Pareek; Kartik Rajput; Gaurav Yadav; Nikhil Narasimhan; Adish Pandya; Deepon Halder; Mohammed Safi Ur Rahman Khan; Praveen S V; Shobhit Banga; Mitesh M Khapra

arXiv:2604.21481·cs.CL·April 24, 2026

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Srija Anand, Ashwin Sankar, Ishvinder Sethi, Aaditya Pareek, Kartik Rajput, Gaurav Yadav, Nikhil Narasimhan, Adish Pandya, Deepon Halder, Mohammed Safi Ur Rahman Khan, Praveen S V, Shobhit Banga, Mitesh M Khapra

PDF

TL;DR

This paper introduces a multilingual TTS evaluation framework using large-scale pairwise comparisons across 10 Indic languages, analyzing human preferences across multiple perceptual dimensions.

Contribution

It presents a controlled, multidimensional evaluation method for multilingual TTS, combining linguistic control with perceptual annotations, and provides a comprehensive preference analysis.

Findings

01

Evaluated 7 state-of-the-art TTS systems with 120K comparisons

02

Collected perceptual judgments across 6 dimensions from 1900+ raters

03

Constructed a multilingual leaderboard and analyzed model trade-offs

Abstract

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.