Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

Jinxing Zhou; Yanghao Zhou; Yaoting Wang; Zongyan Han; Jiaqi Ma; Henghui Ding; Rao Muhammad Anwer; Hisham Cholakkal

arXiv:2602.03892·cs.CV·February 5, 2026

Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

Jinxing Zhou, Yanghao Zhou, Yaoting Wang, Zongyan Han, Jiaqi Ma, Henghui Ding, Rao Muhammad Anwer, Hisham Cholakkal

PDF

Open Access

TL;DR

This paper introduces a new reference-free method for assessing the quality of segmentation masks in language-referred audio-visual segmentation, enabling error detection and quality control without ground-truth annotations.

Contribution

It proposes a novel task, MQ-Auditor, and a benchmark, MQ-RAVSBench, for reference-free mask quality assessment in Ref-AVS, along with a multimodal large language model-based auditor.

Findings

01

MQ-Auditor outperforms existing models in mask quality assessment.

02

The method can detect segmentation failures effectively.

03

It supports downstream segmentation improvement.

Abstract

Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and Audio Processing · Advanced Neural Network Applications