DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows
Dimitri Yatsenko, Thinh T. Nguyen (DataJoint Inc., Houston, USA)

TL;DR
DataJoint 2.0 introduces a relational workflow model that unifies data structure, provenance, and computational dependencies, enabling reliable, agentic scientific workflows with transactional guarantees and extensibility.
Contribution
It presents a novel formal system for scientific workflows that integrates data, dependencies, and integrity constraints, extending with object storage, semantic matching, and distributed coordination.
Findings
Unified schema for data and dependencies
Enhanced data integrity and provenance tracking
Scalable, extensible workflow management
Abstract
Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived -- a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Research Data Management Practices
