Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

Sadegh Rahmaniboldaji; Filip Rybansky; Quoc C. Vuong; Anya C. Hurlbert; Frank Guerin; Andrew Gilbert

arXiv:2603.08317·cs.CV·March 10, 2026

Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations

Sadegh Rahmaniboldaji, Filip Rybansky, Quoc C. Vuong, Anya C. Hurlbert, Frank Guerin, Andrew Gilbert

PDF

Open Access

TL;DR

This study compares human and AI performance in egocentric action recognition under spatial and temporal manipulations, revealing humans rely on sparse cues while models depend on contextual features, highlighting key differences in robustness.

Contribution

The paper introduces a large-scale comparative analysis of human and AI egocentric action recognition using MIRCs, revealing distinct reliance on visual cues and sensitivities to spatial and temporal disruptions.

Findings

01

Humans sharply decline in recognition with spatial reduction, relying on hand-object cues.

02

Models degrade gradually, often relying on context and low-level features.

03

Humans are robust to temporal scrambling if spatial cues are preserved.

Abstract

Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Emotion and Mood Recognition · Action Observation and Synchronization