Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Kwanghee Choi; Eunjung Yeo; Cheol Jun Cho; David R. Mortensen; David Harwath

arXiv:2603.12642·eess.AS·March 16, 2026

Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David R. Mortensen, David Harwath

PDF

Open Access

TL;DR

This paper reveals that self-supervised speech models encode phonetic context through orthogonal subspaces, capturing neighboring phone information and phonetic boundaries within single frame representations.

Contribution

It demonstrates that phonological information from neighboring phones is compositional and orthogonally encoded in frame-level representations, advancing understanding of context in speech models.

Findings

01

Orthogonal subspaces encode relative phonetic positions

02

Single frame representations contain contextual phonological information

03

Implicit phonetic boundaries emerge in model representations

Abstract

Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Face recognition and analysis