Topological quantification of ambiguity in semantic search
Thomas Roland Barillot, Alex De Castro

TL;DR
This paper explores how the topological structure of sentence embeddings encodes semantic ambiguity, using persistent homology metrics to distinguish ambiguous from unambiguous sentences in both simulated and real-world data.
Contribution
It introduces a formal topological framework for quantifying semantic ambiguity in sentence embeddings using persistent homology metrics, validated on a large corpus.
Findings
Ambiguous sentences separate from unambiguous ones in topological metrics.
Persistent homology provides a stable, model-agnostic signal of semantic discontinuities.
Results are consistent across different embedding models and data granularities.
Abstract
We studied how the local topological structure of sentence-embedding neighborhoods encodes semantic ambiguity. Extending ideas that link word-level polysemy to non-trivial persistent homology, we generalized the concept to full sentences and quantified ambiguity of a query in a semantic search process with two persistent homology metrics: the 1-Wasserstein norm of and the maximum loop lifetime of . We formalized the notion of ambiguity as the relative presence of semantic domains or topics in sentences. We then used this formalism to compute "ab-initio" simulations that encode datapoints as linear combination of randomly generated single topics vectors in an arbitrary embedding space and demonstrate that ambiguous sentences separate from unambiguous ones in both metrics. Finally we validated those findings with real-world case by investigating on a fully open corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Text Analysis Techniques · Image Retrieval and Classification Techniques
