From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models
As{\i}m Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani

TL;DR
This paper investigates how large models trained on speech, text, or both develop semantic concepts, using Latent Concept Analysis to compare their internal representations and understanding.
Contribution
It introduces a method to analyze concept formation across speech and text models, highlighting differences and potential benefits of multimodal training.
Findings
Speech and text models develop distinct semantic structures.
Multimodal training leads to richer semantic representations.
Latent Concept Analysis effectively uncovers internal concept abstractions.
Abstract
The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
