ActiView: Evaluating Active Perception Ability for Multimodal Large   Language Models

Ziyue Wang; Chi Chen; Fuwen Luo; Yurui Dong; Yuanchi Zhang; Yuzhuang; Xu; Xiaolong Wang; Peng Li; Yang Liu

arXiv:2410.04659·cs.CV·April 10, 2025

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

Ziyue Wang, Chi Chen, Fuwen Luo, Yurui Dong, Yuanchi Zhang, Yuzhuang, Xu, Xiaolong Wang, Peng Li, Yang Liu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

ActiView introduces a new benchmark to evaluate active perception in multimodal large language models, revealing significant gaps and emphasizing the importance of perceptual control for improved multimodal understanding.

Contribution

This paper presents the first benchmark for active perception in MLLMs, focusing on a specialized VQA task with perceptual restrictions to assess models' active reasoning abilities.

Findings

01

Active perception is crucial but underexplored in MLLMs.

02

Restricted perceptual fields significantly impact model performance.

03

There is a notable gap in active perception capabilities among current MLLMs.

Abstract

Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

THUNLP-MT/ActiView
pytorchOfficial

Datasets

wangphoebe/ActiView
dataset· 21 dl
21 dl

Videos

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus