Modeling Topics and Sociolinguistic Variation in Code-Switched Discourse: Insights from Spanish-English and Spanish-Guaran\'i
Nemika Tyagi, Nelvin Licona Guevara, Olga Kellert

TL;DR
This paper introduces an LLM-assisted annotation pipeline for analyzing sociolinguistic and topical patterns in code-switched discourse across Spanish-English and Spanish-Guaraní, revealing systematic sociolinguistic links and diglossic divisions.
Contribution
It presents a novel LLM-based method for automated sociolinguistic annotation in bilingual discourse, enabling large-scale analysis of sociolinguistic patterns in low-resource languages.
Findings
Systematic links between gender, language dominance, and discourse functions.
A clear diglossic division between formal Guaraní and informal Spanish.
Replication of earlier sociolinguistic observations with corpus-scale data.
Abstract
This study presents an LLM-assisted annotation pipeline for the sociolinguistic and topical analysis of bilingual discourse in two typologically distinct contexts: Spanish-English and Spanish-Guaran\'i. Using large language models, we automatically labeled topic, genre, and discourse-pragmatic functions across a total of 3,691 code-switched sentences, integrated demographic metadata from the Miami Bilingual Corpus, and enriched the Spanish-Guaran\'i dataset with new topic annotations. The resulting distributions reveal systematic links between gender, language dominance, and discourse function in the Miami data, and a clear diglossic division between formal Guaran\'i and informal Spanish in Paraguayan texts. These findings replicate and extend earlier interactional and sociolinguistic observations with corpus-scale quantitative evidence. The study demonstrates that large language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultilingual Education and Policy · Language and cultural evolution · Syntax, Semantics, Linguistic Variation
