Adversarially Masked Video Consistency for Unsupervised Domain Adaptation
Xiaoyu Zhu, Junwei Liang, Po-Yao Huang, Alex Hauptmann

TL;DR
This paper introduces a transformer-based approach for unsupervised domain adaptation in egocentric videos, combining adversarial domain alignment with masked consistency learning to improve class-discriminative and domain-invariant features, evaluated on a new challenging benchmark.
Contribution
It proposes a novel adversarial domain alignment network with masking strategies and a masked consistency learning module for egocentric video adaptation, along with a new benchmark dataset.
Findings
Achieves state-of-the-art results on Epic-Kitchen.
Develops a new challenging egocentric video benchmark U-Ego4D.
Demonstrates effectiveness of combined adversarial and consistency learning.
Abstract
We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
