EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Zhenghao Xing; Xiaowei Hu; Chi-Wing Fu; Wenhai Wang; Jifeng Dai; Pheng-Ann Heng

arXiv:2505.04623·cs.CV·June 20, 2025

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, Pheng-Ann Heng

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

EchoInk-R1 is a reinforcement learning framework that significantly improves audio-visual reasoning in multimodal large language models, achieving higher accuracy and reflective reasoning capabilities on cross-modal tasks.

Contribution

It introduces the first unified audio-visual reasoning framework using reinforcement learning, built upon Qwen2.5-Omni-7B, with a new dataset and optimized training method.

Findings

01

Achieves 85.77% accuracy on AVQA-R1-6K dataset.

02

Outperforms base model with only 562 RL steps.

03

Demonstrates reflective reasoning in multimodal inputs.

Abstract

Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

harryhsing/echoink
pytorchOfficial

Models

🤗
harryhsing/EchoInk-R1-7B
model· 18 dl· ♡ 3
18 dl♡ 3

Datasets

harryhsing/AVQA-R1-6K
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems

MethodsBalanced Selection