DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning

Ahmed G.A.H Ahmed; C. Okan Sakar

arXiv:2604.18964·cs.AI·April 22, 2026

DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning

Ahmed G.A.H Ahmed, C. Okan Sakar

PDF

TL;DR

DW-Bench is a new benchmark for evaluating LLMs on graph-topology reasoning in data warehouses, focusing on foreign-key and data-lineage edges, with experiments highlighting the benefits of tool augmentation.

Contribution

Introduces DW-Bench, a comprehensive benchmark for LLM reasoning on data warehouse schemas, including a large set of verifiable questions and analysis of tool-augmented methods.

Findings

01

Tool-augmented methods outperform static approaches.

02

Performance plateaus on hard compositional subtypes.

03

Benchmark includes 1,046 questions across five schemas.

Abstract

This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.