A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on   Reasoning, Hallucination, and Interactivity

Yejin Bang; Samuel Cahyawijaya; Nayeon Lee; Wenliang Dai; Dan Su,; Bryan Wilie; Holy Lovenia; Ziwei Ji; Tiezheng Yu; Willy Chung; Quyet V. Do,; Yan Xu; Pascale Fung

arXiv:2302.04023·cs.CL·November 29, 2023·352 cites

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su,, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do,, Yan Xu, Pascale Fung

PDF

Open Access 1 Repo

TL;DR

This paper presents a comprehensive evaluation framework for ChatGPT across multiple tasks, languages, and modalities, revealing its strengths and limitations in reasoning, hallucination, and interactivity.

Contribution

It introduces a new evaluation framework and dataset for assessing multilingual, multimodal, and multitask performance of ChatGPT, highlighting its comparative advantages and weaknesses.

Findings

01

ChatGPT outperforms zero-shot LLMs on most tasks.

02

It is better at understanding non-Latin scripts than generating them.

03

Achieves 63.41% accuracy in various reasoning categories.

Abstract

This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hltchkust/chatgpt-evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques