DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis
Ruyi Qi, Zhou Liu, Wentao Zhang

TL;DR
DataCross introduces a comprehensive benchmark and agent framework that enables unified analysis of heterogeneous data sources, including structured and unstructured visual data, addressing a critical gap in data analytics capabilities.
Contribution
It presents a new benchmark with realistic, multi-domain tasks and a specialized agent framework employing expert sub-agents and a novel reReAct mechanism for improved cross-modal data analysis.
Findings
29.7% improvement in factuality over GPT-4o
Superior robustness on high-difficulty tasks
Effective activation of fragmented 'zombie data'
Abstract
In real-world data science and enterprise decision-making, critical information is often fragmented across directly queryable structured sources (e.g., SQL, CSV) and "zombie data" locked in unstructured visual documents (e.g., scanned reports, invoice images). Existing data analytics agents are predominantly limited to processing structured data, failing to activate and correlate this high-value visual information, thus creating a significant gap with industrial needs. To bridge this gap, we introduce DataCross, a novel benchmark and collaborative agent framework for unified, insight-driven analysis across heterogeneous data modalities. DataCrossBench comprises 200 end-to-end analysis tasks across finance, healthcare, and other domains. It is constructed via a human-in-the-loop reverse-synthesis pipeline, ensuring realistic complexity, cross-source dependency, and verifiable ground…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Handwritten Text Recognition Techniques
