DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Shicheng Liu; Yucheng Jiang; Sajid Farook; Camila Nicollier Sanchez; David Fernando Castro Pena; Monica S. Lam

arXiv:2604.06474·cs.CL·April 9, 2026

DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, Monica S. Lam

PDF

TL;DR

DataSTORM is an LLM-based system that autonomously conducts deep research over large-scale structured databases and internet sources, emphasizing hypothesis generation and analytical narratives.

Contribution

It introduces a novel LLM agentic framework grounded in Exploratory Data Analysis and Data Storytelling for structured data research.

Findings

01

Achieves 19.4% improvement in insight-level recall on InsightBench.

02

Outperforms proprietary systems like ChatGPT Deep Research.

03

Demonstrates effectiveness on complex real-world databases.

Abstract

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.