Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan

Jui-Ming Yao; Bing-Cheng Xie; Sheng-Wei Peng; Hao-Yuan Chen; He-Rong Zheng; Bing-Jia Tan; Peter Shaojui Wang; Shun-Feng Su

arXiv:2508.01274·cs.AI·August 5, 2025

Multi-TW: Benchmarking Multimodal Models on Traditional Chinese Question Answering in Taiwan

Jui-Ming Yao, Bing-Cheng Xie, Sheng-Wei Peng, Hao-Yuan Chen, He-Rong Zheng, Bing-Jia Tan, Peter Shaojui Wang, Shun-Feng Su

PDF

Open Access

TL;DR

Multi-TW is a comprehensive benchmark for evaluating multimodal models on Traditional Chinese question answering, considering performance and inference latency across various modalities and model types.

Contribution

This paper introduces Multi-TW, the first benchmark for Traditional Chinese multimodal question answering, including latency evaluation and diverse model assessments.

Findings

01

Closed-source models outperform open-source ones in most modalities.

02

Open-source models perform well in audio tasks.

03

End-to-end pipelines have lower latency than separate audio transcription and VLMs.

Abstract

Multimodal Large Language Models (MLLMs) process visual, acoustic, and textual inputs, addressing the limitations of single-modality LLMs. However, existing benchmarks often overlook tri-modal evaluation in Traditional Chinese and do not consider inference latency. To address this, we introduce Multi-TW, the first Traditional Chinese benchmark for evaluating the performance and latency of any-to-any multimodal models. Multi-TW includes 900 multiple-choice questions (image and text, audio and text pairs) sourced from official proficiency tests developed with the Steering Committee for the Test of Proficiency-Huayu (SC-TOP). We evaluated various any-to-any models and vision-language models (VLMs) with audio transcription. Our results show that closed-source models generally outperform open-source ones across modalities, although open-source models can perform well in audio tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning