AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu; Guo Chen; Zhiqi Li; Yicheng Liu; Tong Lu

arXiv:2506.05328·cs.CV·July 23, 2025

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, Tong Lu

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces CG-AV-Counting, a comprehensive multimodal counting benchmark, and proposes AV-Reasoner, a model that improves counting in video understanding tasks through reinforcement learning, achieving state-of-the-art results.

Contribution

The paper presents a new large-scale, clue-grounded counting benchmark and a novel model, AV-Reasoner, that enhances counting capabilities in multimodal video understanding.

Findings

01

AV-Reasoner achieves state-of-the-art results on multiple benchmarks.

02

Reinforcement learning improves counting performance.

03

Language-based reasoning struggles on out-of-domain data.

Abstract

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model's counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
lulidong/AV-Reasoner-7B
model· 2 dl· ♡ 5
2 dl♡ 5

Datasets

CG-Bench/CG-AV-Counting
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning