VISOR: A Vision-Language Model-based Test Oracle for Testing Robots

Prasun Saurabh; Pablo Valle; Aitor Arrieta; Shaukat Ali; Paolo Arcaini

arXiv:2605.10408·cs.SE·May 19, 2026

VISOR: A Vision-Language Model-based Test Oracle for Testing Robots

Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, Paolo Arcaini

PDF

TL;DR

VISOR introduces a vision-language model-based automated test oracle for robots, improving the efficiency and objectivity of task correctness and quality assessments without human intervention.

Contribution

This work presents VISOR, a novel approach leveraging VLMs to automate robot testing and explicitly quantify assessment uncertainty, addressing limitations of traditional symbolic oracles.

Findings

01

Gemini achieves higher recall in evaluations.

02

GPT achieves higher precision.

03

Both models show low correlation between uncertainty and correctness.

Abstract

Testing robots requires assessing whether they perform their intended tasks correctly, dependably, and with high quality, a challenge known as the test oracle problem in software testing. Traditionally, this assessment relies on task-specific symbolic oracles for task correctness and on human manual evaluation of robot behavior, which is time-consuming, subjective, and error-prone. To address this, we propose VISOR, a Vision-Language Model (VLM)-based approach for automated test oracle assessment that eliminates the need of expensive human evaluations. VISOR performs automated evaluation of task correctness and quality, addressing the limitations of existing symbolic test oracles, which are task-specific and provide pass/fail judgments without explicitly quantifying task quality. Given the inherent uncertainty in VLMs, VISOR also explicitly quantifies its own uncertainty during test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.