SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

Xiang Li; Heqian Qiu; Lanxiao Wang; Benliu Qiu; Fanman Meng; Linfeng Xu; Hongliang Li

arXiv:2603.12764·cs.CV·March 16, 2026

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, Hongliang Li

PDF

Open Access

TL;DR

SAVA-X introduces a novel framework for detecting imitation errors across egocentric and exocentric videos, effectively addressing cross-view challenges in industrial and healthcare settings.

Contribution

It proposes a unified approach with adaptive sampling, scene-aware embeddings, and bidirectional fusion to improve error detection in asynchronous, mismatched videos.

Findings

01

SAVA-X outperforms baselines on the EgoMe benchmark.

02

Component ablations show the effectiveness of each module.

03

The method handles cross-view domain shifts effectively.

Abstract

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego $\to$ Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis