MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao, Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing, Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu,, Mingqin Li, Chuanjie Liu, Zengzhong Li

TL;DR
MS MARCO Web Search is a large-scale, real-world web dataset with millions of click labels, designed to advance research in neural indexing, embedding models, and information retrieval systems.
Contribution
It introduces the first large-scale, information-rich web dataset with real click labels, enabling new research directions in AI and information retrieval.
Findings
Provides a comprehensive retrieval benchmark with three challenge tasks.
Enables research in neural indexers, embeddings, and large language models.
Fosters advancements in AI and system research using real-world web data.
Abstract
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
