Investigating Distributions of Telecom Adapted Sentence Embeddings for Document Retrieval
Sujoy Roychowdhury, Sumit Soman, Ranjani Hosakere Gireesha, Vansh Chhabra, Neeraj Gunda, Subhadip Bandyopadhyay, Sai Krishna Bala

TL;DR
This study evaluates how telecom domain adaptation of sentence embeddings affects document retrieval accuracy, confidence intervals, and distributional properties, providing insights and metrics for optimizing embedding-based retrieval systems.
Contribution
It introduces a systematic method for threshold determination, new metrics for distribution overlap, and insights into the effects of fine-tuning and domain adaptation on embedding quality.
Findings
Fine-tuning improves mean accuracy and confidence intervals.
Domain-specific embeddings differ significantly from domain-agnostic ones.
Embedding isotropy is poorly correlated with retrieval performance.
Abstract
A plethora of sentence embedding models makes it challenging to choose one, especially for technical domains rich with specialized vocabulary. In this work, we domain adapt embeddings using telecom data for question answering. We evaluate embeddings obtained from publicly available models and their domain-adapted variants, on both point retrieval accuracies, as well as their (95%) confidence intervals. We establish a systematic method to obtain thresholds for similarity scores for different embeddings. As expected, we observe that fine-tuning improves mean bootstrapped accuracies. We also observe that it results in tighter confidence intervals, which further improve when pre-training is preceded by fine-tuning. We introduce metrics which measure the distributional overlaps of top-, correct and random document similarities with the question. Further, we show that these metrics are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
