IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation
Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang, Yulia Tsvetkov, Graham Neubig

TL;DR
This paper introduces IDIOLEX, a framework for learning continuous sentence representations that capture style and dialect, independent of semantic content, to improve stylistic analysis and language model alignment.
Contribution
It proposes a novel method for decoupling style and dialect from semantics in sentence representations, evaluated on Arabic and Spanish dialects.
Findings
Representations capture meaningful stylistic variation.
Models transfer effectively across different domains.
Joint modeling of individual and community variation benefits downstream tasks.
Abstract
Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence's provenance with linguistic features of a sentence's content, to learn a continuous representation of each sentence's style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
