Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity
Chenhao Yang, Ben He, Yanhua Ran

TL;DR
This paper introduces a novel document-to-document similarity method using embeddings for ad-hoc retrieval, addressing limitations of query-based similarity by leveraging content-based comparisons, and demonstrates improved performance on TREC datasets.
Contribution
The paper proposes a new semantic relevance scoring method based on document-to-document similarity of embeddings, enhancing retrieval accuracy over traditional query-based approaches.
Findings
Outperforms strong baselines on TREC test collections
Addresses the problem of multiple degrees of similarity in embeddings
Demonstrates the effectiveness of content-based similarity in IR
Abstract
Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q) similarity in their embeddings. As words with similar syntactic usage are usually very close to each other in the embeddings space, although they are not semantically similar, the D2Q similarity approach may suffer from the problem of "multiple degrees of similarity". To this end, this paper proposes a novel approach that estimates a semantic relevance score (SEM) based on document-to-document (D2D) similarity of embeddings. As Word or Para2Vec generates embeddings by the context of words/paragraphs, the D2D similarity approach turns the task of document ranking into the estimation of similarity between content within different documents. Experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Information Retrieval and Search Behavior
