Analyse spectrale des textes: d\'etection automatique des fronti\`eres de langue et de discours
Pascal Vaillant, Richard Nock, Claudia Henry

TL;DR
This paper introduces a spectral analysis method for automatically detecting language and discourse boundaries in texts by clustering vocabulary based on syntagmatic and paradigmatic similarities, useful for multilingual and mixed-language corpora.
Contribution
It presents a novel spectral analysis framework for clustering vocabulary into sublanguages and linguistic classes, enabling automatic boundary detection in multilingual texts.
Findings
Spectral analysis of transition matrices reveals word distributions within clusters.
Words cluster into sublanguages and semantic classes based on similarity measures.
Method effectively segments multilingual texts into homogeneous linguistic segments.
Abstract
We propose a theoretical framework within which information on the vocabulary of a given corpus can be inferred on the basis of statistical information gathered on that corpus. Inferences can be made on the categories of the words in the vocabulary, and on their syntactical properties within particular languages. Based on the same statistical data, it is possible to build matrices of syntagmatic similarity (bigram transition matrices) or paradigmatic similarity (probability for any pair of words to share common contexts). When clustered with respect to their syntagmatic similarity, words tend to group into sublanguage vocabularies, and when clustered with respect to their paradigmatic similarity, into syntactic or semantic classes. Experiments have explored the first of these two possibilities. Their results are interpreted in the frame of a Markov chain modelling of the corpus'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Image Retrieval and Classification Techniques · Rough Sets and Fuzzy Logic
