Octopus: A Lightweight Entity-Aware System for Multi-Table Data Discovery and Cell-Level Retrieval
Wen-Zhi Li, Sainyam Galhotra

TL;DR
Octopus is a lightweight, entity-aware system that improves multi-table data discovery and cell-level retrieval by using an LLM parser for entity identification and a compact index, avoiding heavy offline preprocessing.
Contribution
It introduces a training-free, entity-aware approach for multi-table data discovery and cell retrieval that outperforms existing systems in accuracy and efficiency.
Findings
Outperforms existing systems in multi-table discovery tasks.
Achieves lower computational and token costs.
Supports both independent and join-based discovery.
Abstract
Tabular data constitute a dominant form of information in modern data lakes and repositories, yet discovering the relevant tables to answer user questions remains challenging. Existing data discovery systems assume that each question can be answered by a single table and often rely on resource-intensive offline preprocessing, such as model training or large-scale content indexing. In practice, however, many questions require information spread across multiple tables -- either independently or through joins -- and users often seek specific cell values rather than entire tables. In this paper, we present Octopus, a lightweight, entity-aware, and training-free system for multi-table data discovery and cell-level value retrieval. Instead of embedding entire questions, Octopus identifies fine-grained entities (column mentions and value mentions) from natural-language queries using an LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Advanced Database Systems and Queries
