TL;DR
KITE is a training-free, keyframe-based visual front-end that converts robot videos into compact, interpretable tokens for failure analysis using vision-language models, improving detection and explanation tasks.
Contribution
KITE introduces a novel, training-free approach to convert robot videos into structured tokens for failure analysis, enhancing interpretability and performance across multiple tasks.
Findings
Significant improvement in failure detection, identification, and localization on RoboFAC benchmark.
Effective application to real robot videos demonstrating practical utility.
Fine-tuning with QLoRA enhances explanation and correction capabilities.
Abstract
We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
