jXBW: Fast Substructure Search for Large-Scale JSONL Datasets with LLM Applications

Yasuo Tabei

arXiv:2508.12536·cs.DB·September 19, 2025

jXBW: Fast Substructure Search for Large-Scale JSONL Datasets with LLM Applications

Yasuo Tabei

PDF

Open Access

TL;DR

jXBW is a novel compressed index that enables fast, scalable substructure search in large JSONL datasets, significantly improving efficiency for applications like LLM prompts, chemical data, and geospatial analytics.

Contribution

The paper introduces jXBW, a new compressed index with innovative tree representation and search algorithms, enabling efficient substructure search in large JSONL datasets.

Findings

01

Achieves up to 4,700× speedup over tree-based methods.

02

Over 6 million times faster than XML-based approaches.

03

Enables practical large-scale JSONL substructure search.

Abstract

JSON Lines (JSONL) is widely used for managing large collections of semi-structured data, ranging from large language model (LLM) prompts to chemical compound records and geospatial datasets. A key operation is substructure search, which identifies all JSON objects containing a query pattern. This task underpins applications such as drug discovery (querying compounds for functional groups), prompt engineering (extracting prompts with schema fragments), and geospatial analytics (finding entities with nested attributes). However, existing methods are inefficient: traversal requires exhaustive tree matching, succinct JSON representations save space but do not accelerate search, and XML-based approaches incur conversion overhead and semantic mismatches. We present jXBW, a compressed index for efficient substructure search over JSONL. jXBW introduces three innovations: (i) a merged tree…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Database Systems and Queries · Computational Drug Discovery Methods