TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes

Chao Zhang; Shaolei Zhang; Quehuan Liu; Sibei Chen; Tong Li; Ju Fan

arXiv:2505.11270·cs.DB·May 19, 2025

TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes

Chao Zhang, Shaolei Zhang, Quehuan Liu, Sibei Chen, Tong Li, Ju Fan

PDF

Open Access

TL;DR

This paper introduces TAIJI, a multi-modal data analytics system leveraging MCP architecture, specialized models, and knowledge updating techniques to improve accuracy, efficiency, and data freshness in data lakes.

Contribution

It proposes a novel MCP-based architecture with semantic operators, an AI-agent NL2Operator translator, and a data updating mechanism for multi-modal data analytics.

Findings

01

Enhanced accuracy and efficiency in multi-modal data processing

02

Scalable modular deployment of specialized foundation models

03

Effective data freshness management using machine unlearning techniques

Abstract

The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Research Data Management Practices