CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart
Bowen Zhao, Tianhao Cheng, Yuejie Zhang, Ying Cheng, Rui Feng, Xiaobo, Zhang

TL;DR
This paper introduces CT2C-QA, a comprehensive Chinese multimodal QA dataset combining text, tables, and charts, along with a multi-agent system called AED for reasoning, highlighting current models' limitations in handling such complex data.
Contribution
The paper presents the first Chinese multimodal QA dataset with text, tables, and charts, and proposes a multi-agent reasoning system, AED, to improve analysis and decision-making.
Findings
Current models, including GPT-4, underperform on the dataset.
The AED system outperforms existing models in multimodal reasoning tasks.
The dataset effectively tests models' ability to analyze diverse data modalities.
Abstract
Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existing researches in MMQA only focus on two modalities such as image-text QA, table-text QA and chart-text QA, and there remains a notable scarcity in studies that investigate the joint analysis of text, tables, and charts. In this paper, we present CC-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts, meticulously compiled from 200 selectively sourced webpages. Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data, because the answer to a question could appear in various modalities, or even potentially not exist at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Layer Normalization · Residual Connection · Byte Pair Encoding · Absolute Position Encodings · Multi-Head Attention · Softmax
