jXBW: Fast Substructure Search for Large-Scale JSONL Datasets with LLM Applications
Yasuo Tabei

TL;DR
jXBW is a novel compressed index that enables fast, scalable substructure search in large JSONL datasets, significantly improving efficiency for applications like LLM prompts, chemical data, and geospatial analytics.
Contribution
The paper introduces jXBW, a new compressed index with innovative tree representation and search algorithms, enabling efficient substructure search in large JSONL datasets.
Findings
Achieves up to 4,700× speedup over tree-based methods.
Over 6 million times faster than XML-based approaches.
Enables practical large-scale JSONL substructure search.
Abstract
JSON Lines (JSONL) is widely used for managing large collections of semi-structured data, ranging from large language model (LLM) prompts to chemical compound records and geospatial datasets. A key operation is substructure search, which identifies all JSON objects containing a query pattern. This task underpins applications such as drug discovery (querying compounds for functional groups), prompt engineering (extracting prompts with schema fragments), and geospatial analytics (finding entities with nested attributes). However, existing methods are inefficient: traversal requires exhaustive tree matching, succinct JSON representations save space but do not accelerate search, and XML-based approaches incur conversion overhead and semantic mismatches. We present jXBW, a compressed index for efficient substructure search over JSONL. jXBW introduces three innovations: (i) a merged tree…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Computational Drug Discovery Methods
