Toward Scalable Audio Description Quality Control: A Workflow for Evaluating Human and VLM Raters
Lana Do, Gio Jung, Juvenal Francisco Barajas, Andrew Taylor Scott, Shasta Ihorn, Alexander Mario Blum, Vassilis Athitsos, Ilmi Yoon

TL;DR
This paper introduces a scalable workflow using Item Response Theory to evaluate audio description quality by comparing vision-language models and human raters against expert standards, highlighting VLMs' potential and limitations.
Contribution
It presents a novel methodological framework for assessing long-form audio description quality at scale, integrating VLMs and human raters based on professional guidelines.
Findings
Top VLMs can match ground-truth ratings similar to humans.
VLM reasoning is less reliable and less actionable than human judgment.
The workflow supports scalable quality control for audio description.
Abstract
Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision users are excluded. While crowdsourced platforms and vision-language models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving open the question of how to assess long-form AD quality at scale. To address this, we developed a methodological workflow using Item Response Theory to evaluate VLM and human rater proficiency against expert-established ground truth. Evaluations were based on a six-dimensional framework, grounded in professional guidelines and shaped by insights from our accessibility experts and blind consultants. Findings suggest that top-performing VLMs can approximate ground-truth ratings at levels comparable to human raters. However, qualitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
