KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Mehdi Hosseinzadeh; King Hang Wong; Feras Dayoub

arXiv:2604.07034·cs.RO·April 9, 2026

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub

PDF

2 Repos

TL;DR

KITE is a training-free, keyframe-based visual front-end that converts robot videos into compact, interpretable tokens for failure analysis using vision-language models, improving detection and explanation tasks.

Contribution

KITE introduces a novel, training-free approach to convert robot videos into structured tokens for failure analysis, enhancing interpretability and performance across multiple tasks.

Findings

01

Significant improvement in failure detection, identification, and localization on RoboFAC benchmark.

02

Effective application to real robot videos demonstrating practical utility.

03

Fine-tuning with QLoRA enhances explanation and correction capabilities.

Abstract

We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.