Extracting Document Relations from Search Corpus by Marginalizing over User Queries
Yuki Iwamoto, Kaoru Tsunoda, Ken Kaneiwa

TL;DR
This paper introduces EDR-MQ, a novel framework that discovers document relationships by analyzing co-occurrence patterns across diverse user queries, eliminating the need for manual annotations or predefined taxonomies.
Contribution
The paper presents a new query marginalization approach and MC-RAG method to estimate document relationships from search results without labeled data.
Findings
Successfully identifies topical clusters and evidence chains.
Reveals cross-domain connections not found by traditional methods.
Adapts to different user perspectives and information needs.
Abstract
Understanding relationships between documents in large-scale corpora is essential for knowledge discovery and information organization. However, existing approaches rely heavily on manual annotation or predefined relationship taxonomies. We propose EDR-MQ (Extracting Document Relations by Marginalizing over User Queries), a novel framework that discovers document relationships through query marginalization. EDR-MQ is based on the insight that strongly related documents often co-occur in results across diverse user queries, enabling us to estimate joint probabilities between document pairs by marginalizing over a collection of queries. To enable this query marginalization approach, we develop Multiply Conditioned Retrieval-Augmented Generation (MC-RAG), which employs conditional retrieval where subsequent document retrievals depend on previously retrieved content. By observing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Semantic Web and Ontologies · Advanced Text Analysis Techniques
