TensorBank: Tensor Lakehouse for Foundation Model Training
Romeo Kienzler, Leonardo Pondian Tizzei, Benedikt Blumenstiel, Zoltan, Arnold Nagy, S. Karthik Mukkavilli, Johannes Schmude, Marcus Freitag, Michael, Behrendt, Daniel Salles Civitarese, Naomi Simumba, Daiki Kimura, Hendrik, Hamann

TL;DR
TensorBank introduces a scalable tensor lakehouse architecture that enables efficient streaming and querying of high-dimensional data for foundation model training, leveraging hierarchical indices and open standards.
Contribution
The paper presents TensorBank, a novel petabyte-scale tensor lakehouse architecture that streamlines data streaming, querying, and transformation for foundation models using hierarchical statistical indices.
Findings
Supports wire-speed tensor streaming from cloud storage to GPU memory.
Enables efficient querying with hierarchical statistical indices to skip irrelevant data.
Generalizes to various data types beyond geospatial-temporal data.
Abstract
Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Tensor decomposition and applications · Traffic Prediction and Management Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
