Real-time Log Query Interface for large datasets using Apache Spark

Sandeep Sandha; Xin Xu; Yue Xin; Zhehan Li

arXiv:1709.08001·cs.DC·September 26, 2017

Real-time Log Query Interface for large datasets using Apache Spark

Sandeep Sandha, Xin Xu, Yue Xin, Zhehan Li

PDF

Open Access

TL;DR

This paper presents a web-based log query interface leveraging Apache Spark to enable fast, interactive querying of large datasets, significantly reducing response times compared to traditional databases.

Contribution

It introduces a real-time, web-based log query system that uses Spark clusters for efficient parallel processing of large datasets, improving response times over traditional methods.

Findings

01

Query response time on 6GB datasets is under 1 second.

02

Traditional MySQL queries take over 60 seconds on the same data.

03

System supports complex SQL queries including joins on large tables.

Abstract

Log Query Interface is an interactive web application that allows users to query the very large data logs of MobileInsight easily and efficiently. With this interface, users no longer need to talk to the database through command line queries, nor to install the MobileInsight client locally to fetch data. Users can simply select/type the query message through our web based system which queries the database very efficiently and responds back to user. While testing on 6GB of datasets our system takes less than 1 seconds to respond back, the similar queries on traditional MySql database takes more than 60 seconds. The system gives user the capability to execute all the queries using sql query language. User can perform complex join operations on very large tables. The query response time is hugely improved by the server side Spark clusters, which stores the big datasets in a distributed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGreen IT and Sustainability · Caching and Content Delivery · Distributed systems and fault tolerance