Toward Scalable Audio Description Quality Control: A Workflow for Evaluating Human and VLM Raters

Lana Do; Gio Jung; Juvenal Francisco Barajas; Andrew Taylor Scott; Shasta Ihorn; Alexander Mario Blum; Vassilis Athitsos; Ilmi Yoon

arXiv:2602.01390·cs.HC·May 8, 2026

Toward Scalable Audio Description Quality Control: A Workflow for Evaluating Human and VLM Raters

Lana Do, Gio Jung, Juvenal Francisco Barajas, Andrew Taylor Scott, Shasta Ihorn, Alexander Mario Blum, Vassilis Athitsos, Ilmi Yoon

PDF

TL;DR

This paper introduces a scalable workflow using Item Response Theory to evaluate audio description quality by comparing vision-language models and human raters against expert standards, highlighting VLMs' potential and limitations.

Contribution

It presents a novel methodological framework for assessing long-form audio description quality at scale, integrating VLMs and human raters based on professional guidelines.

Findings

01

Top VLMs can match ground-truth ratings similar to humans.

02

VLM reasoning is less reliable and less actionable than human judgment.

03

The workflow supports scalable quality control for audio description.

Abstract

Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision users are excluded. While crowdsourced platforms and vision-language models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving open the question of how to assess long-form AD quality at scale. To address this, we developed a methodological workflow using Item Response Theory to evaluate VLM and human rater proficiency against expert-established ground truth. Evaluations were based on a six-dimensional framework, grounded in professional guidelines and shaped by insights from our accessibility experts and blind consultants. Findings suggest that top-performing VLMs can approximate ground-truth ratings at levels comparable to human raters. However, qualitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.