BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and   Multistructured Data

Xuwu Wang; Qiwen Cui; Yunzhe Tao; Yiran Wang; Ziwei Chai; Xiaotian; Han; Boyi Liu; Jianbo Yuan; Jing Su; Guoyin Wang; Tingkai Liu; Liyu Chen,; Tianyi Liu; Tao Sun; Yufeng Zhang; Sirui Zheng; Quanzeng You; Yang Yang,; Hongxia Yang

arXiv:2410.00773·cs.AI·October 2, 2024

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Xuwu Wang, Qiwen Cui, Yunzhe Tao, Yiran Wang, Ziwei Chai, Xiaotian, Han, Boyi Liu, Jianbo Yuan, Jing Su, Guoyin Wang, Tingkai Liu, Liyu Chen,, Tianyi Liu, Tao Sun, Yufeng Zhang, Sirui Zheng, Quanzeng You, Yang Yang,, Hongxia Yang

PDF

Open Access 1 Repo

TL;DR

BabelBench is a comprehensive benchmark framework designed to evaluate large language models' abilities in handling multimodal, multistructured data with code execution, revealing significant room for improvement even in state-of-the-art models.

Contribution

This paper introduces BabelBench, a novel unified benchmark with 247 problems for assessing LLMs on multimodal, multistructured data processing and reasoning tasks.

Findings

01

ChatGPT-4 shows substantial room for improvement.

02

BabelBench covers perception, reasoning, and debugging tasks.

03

Provides guidance for future research in multimodal data handling.

Abstract

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse data handling scenarios. In response, we introduce BabelBench, an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. BabelBench incorporates a dataset comprising 247 meticulously curated problems that challenge the models with tasks in perception, commonsense reasoning, logical reasoning, and so on. Besides the basic capabilities of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ffd8ffe/babelbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need