BigDataBench: a Big Data Benchmark Suite from Web Search Engines
Wanling Gao, Yuqing Zhu, Zhen Jia, Chunjie Luo, Lei Wang, Zhiguo Li,, Jianfeng Zhan, Yong Qi, Yongqiang He, Shiming Gong, Xiaona Li, Shujie Zhang,, and Bizhu Qiu

TL;DR
This paper introduces BigDataBench, a comprehensive big data benchmark suite based on search engine workloads, developed through industry collaboration, open data, and innovative data generation techniques to evaluate big data systems.
Contribution
The paper presents the first big data benchmark suite from search engines, including a scalable data generation method and case studies demonstrating its application.
Findings
Created a scalable, realistic big data benchmark suite from search engine workloads.
Developed a novel data generation tool preserving data semantics and locality.
Conducted case studies validating the benchmark's utility in system and architecture research.
Abstract
This paper presents our joint research efforts on big data benchmarking with several industrial partners. Considering the complexity, diversity, workload churns, and rapid evolution of big data systems, we take an incremental approach in big data benchmarking. For the first step, we pay attention to search engines, which are the most important domain in Internet services in terms of the number of page views and daily visitors. However, search engine service providers treat data, applications, and web access logs as business confidentiality, which prevents us from building benchmarks. To overcome those difficulties, with several industry partners, we widely investigated the open source solutions in search engines, and obtained the permission of using anonymous Web access logs. Moreover, with two years' great efforts, we created a sematic search engine named ProfSearch (available from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Management and Algorithms · Cloud Computing and Resource Management
