InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Henry Hengyuan Zhao; Wenqi Pei; Yifei Tao; Haiyang Mei; Mike Zheng Shou

arXiv:2502.15027·cs.CL·November 10, 2025

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou

PDF

1 Datasets

TL;DR

This paper introduces InterFeedback, a framework and benchmark for evaluating the interactive intelligence of large multimodal models with human feedback, revealing current limitations in models' ability to refine responses based on feedback.

Contribution

It proposes a universal interactive evaluation framework, introduces a new benchmark and human-annotated dataset, and assesses state-of-the-art models' performance in interactive scenarios.

Findings

01

State-of-the-art models perform poorly in feedback-based response refinement.

02

InterFeedback-Bench effectively evaluates models' interactive capabilities.

03

Models like OpenAI-o1 score below 50% in feedback-based tasks.

Abstract

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-Sonnet-4. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

hhenryz/InterFeedBack-Human
dataset· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.