TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes
Chao Zhang, Shaolei Zhang, Quehuan Liu, Sibei Chen, Tong Li, Ju Fan

TL;DR
This paper introduces TAIJI, a multi-modal data analytics system leveraging MCP architecture, specialized models, and knowledge updating techniques to improve accuracy, efficiency, and data freshness in data lakes.
Contribution
It proposes a novel MCP-based architecture with semantic operators, an AI-agent NL2Operator translator, and a data updating mechanism for multi-modal data analytics.
Findings
Enhanced accuracy and efficiency in multi-modal data processing
Scalable modular deployment of specialized foundation models
Effective data freshness management using machine unlearning techniques
Abstract
The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Research Data Management Practices
