Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition

Nathan Roll; Pranav Bhalerao; Martijn Bartelds; Arjun Pawar; Yuka Tatsumi; Tolulope Ogunremi; Chen Shani; Calbert Graham; Meghan Sumner; and Dan Jurafsky

arXiv:2601.06972·cs.CL·January 13, 2026

Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition

Nathan Roll, Pranav Bhalerao, Martijn Bartelds, Arjun Pawar, Yuka Tatsumi, Tolulope Ogunremi, Chen Shani, Calbert Graham, Meghan Sumner, and Dan Jurafsky

PDF

Open Access

TL;DR

This paper investigates the different processing strategies of Transformer and Conformer architectures in speech recognition, revealing that Conformers categorize early while Transformers integrate information later, with implications for their use cases.

Contribution

Introduces Architectural Fingerprinting, a probing framework to analyze and compare the processing hierarchies of Transformer and Conformer speech models.

Findings

01

Conformers categorize phonemes 29% earlier in depth.

02

Transformers defer phoneme and speaker encoding to deeper layers.

03

Different hierarchies suggest distinct advantages for low-latency and context-rich tasks.

Abstract

In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Face recognition and analysis