Recur, Attend or Convolve? On Whether Temporal Modeling Matters for Cross-Domain Robustness in Action Recognition
Sofia Broom\'e, Ernest Pokropek, Boyu Li, Hedvig Kjellstr\"om

TL;DR
This paper investigates whether temporal modeling choices, like recurrence, improve cross-domain robustness in action recognition, highlighting the importance of physical inductive biases over purely parameterized models.
Contribution
It introduces the Temporal Shape dataset and modified Diving48 domains to systematically assess the impact of temporal modeling on texture bias and robustness.
Findings
Recurrence may enhance domain shift robustness in action recognition.
Temporal shape cues are crucial for generalization across domains.
Physical inductive biases outperform texture biases in robustness.
Abstract
Most action recognition models today are highly parameterized, and evaluated on datasets with appearance-wise distinct classes. It has also been shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape in still image recognition tasks, in contrast to humans. Taken together, this raises suspicion that large video models partly learn spurious spatial texture correlations rather than to track relevant shapes over time to infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence. Biological vision consists of abundant recurrent circuitry, and is superior to computer vision in terms of domain shift generalization. In this article, we empirically study whether the choice of low-level temporal modeling has consequences for texture bias and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsSpatio-temporal stability analysis
