View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen; Yasutomo Kawanishi; Vijay John; Takahiro Komamizu; Ichiro Ide

arXiv:2511.12870·cs.CV·December 25, 2025

View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

PDF

Open Access

TL;DR

This paper introduces ViCoKD, a novel framework for multi-view action recognition that effectively distills knowledge from fully supervised multi-modal teachers to students with limited modalities, addressing view misalignment and partial observations in real-world scenarios.

Contribution

It proposes a view-aware cross-modal distillation method with a cross-modal adapter and view consistency module, improving recognition accuracy under limited and partial view conditions.

Findings

01

Outperforms existing distillation methods on MultiSensor-Home dataset.

02

Achieves significant accuracy gains with limited modalities.

03

Surpasses teacher performance in constrained settings.

Abstract

The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Context-Aware Activity Recognition Systems